17 Oct, 2020

1 commit

  • Fix multiple occurrences of duplicated words in kernel/.

    Fix one typo/spello on the same line as a duplicate word. Change one
    instance of "the the" to "that the". Otherwise just drop one of the
    repeated words.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/98202fa6-8919-ef63-9efe-c0fad5ca7af1@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

09 May, 2020

1 commit

  • Add a simple struct nsset. It holds all necessary pieces to switch to a new
    set of namespaces without leaving a task in a half-switched state which we
    will make use of in the next patch. This patch switches the existing setns
    logic over without causing a change in setns() behavior. This brings
    setns() closer to how unshare() works(). The prepare_ns() function is
    responsible to prepare all necessary information. This has two reasons.
    First it minimizes dependencies between individual namespaces, i.e. all
    install handler can expect that all fields are properly initialized
    independent in what order they are called in. Second, this makes the code
    easier to maintain and easier to follow if it needs to be changed.

    The prepare_ns() helper will only be switched over to use a flags argument
    in the next patch. Here it will still use nstype as a simple integer
    argument which was argued would be clearer. I'm not particularly
    opinionated about this if it really helps or not. The struct nsset itself
    already contains the flags field since its name already indicates that it
    can contain information required by different namespaces. None of this
    should have functional consequences.

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com

    Christian Brauner
     

09 Jul, 2019

1 commit

  • …/git/dhowells/linux-fs

    Pull keyring namespacing from David Howells:
    "These patches help make keys and keyrings more namespace aware.

    Firstly some miscellaneous patches to make the process easier:

    - Simplify key index_key handling so that the word-sized chunks
    assoc_array requires don't have to be shifted about, making it
    easier to add more bits into the key.

    - Cache the hash value in the key so that we don't have to calculate
    on every key we examine during a search (it involves a bunch of
    multiplications).

    - Allow keying_search() to search non-recursively.

    Then the main patches:

    - Make it so that keyring names are per-user_namespace from the point
    of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
    accessible cross-user_namespace.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

    - Move the user and user-session keyrings to the user_namespace
    rather than the user_struct. This prevents them propagating
    directly across user_namespaces boundaries (ie. the KEY_SPEC_*
    flags will only pick from the current user_namespace).

    - Make it possible to include the target namespace in which the key
    shall operate in the index_key. This will allow the possibility of
    multiple keys with the same description, but different target
    domains to be held in the same keyring.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

    - Make it so that keys are implicitly invalidated by removal of a
    domain tag, causing them to be garbage collected.

    - Institute a network namespace domain tag that allows keys to be
    differentiated by the network namespace in which they operate. New
    keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
    the network domain in force when they are created.

    - Make it so that the desired network namespace can be handed down
    into the request_key() mechanism. This allows AFS, NFS, etc. to
    request keys specific to the network namespace of the superblock.

    This also means that the keys in the DNS record cache are
    thenceforth namespaced, provided network filesystems pass the
    appropriate network namespace down into dns_query().

    For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
    cache keyrings, such as idmapper keyrings, also need to set the
    domain tag - for which they need access to the network namespace of
    the superblock"

    * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Pass the network namespace into request_key mechanism
    keys: Network namespace domain tag
    keys: Garbage collect keys for which the domain has been removed
    keys: Include target namespace in match criteria
    keys: Move the user and user-session keyrings to the user_namespace
    keys: Namespace keyring names
    keys: Add a 'recurse' flag for keyring searches
    keys: Cache the hash value to avoid lots of recalculation
    keys: Simplify key description management

    Linus Torvalds
     

27 Jun, 2019

2 commits

  • Move the user and user-session keyrings to the user_namespace struct rather
    than pinning them from the user_struct struct. This prevents these
    keyrings from propagating across user-namespaces boundaries with regard to
    the KEY_SPEC_* flags, thereby making them more useful in a containerised
    environment.

    The issue is that a single user_struct may be represent UIDs in several
    different namespaces.

    The way the patch does this is by attaching a 'register keyring' in each
    user_namespace and then sticking the user and user-session keyrings into
    that. It can then be searched to retrieve them.

    Signed-off-by: David Howells
    cc: Jann Horn

    David Howells
     
  • Keyring names are held in a single global list that any process can pick
    from by means of keyctl_join_session_keyring (provided the keyring grants
    Search permission). This isn't very container friendly, however.

    Make the following changes:

    (1) Make default session, process and thread keyring names begin with a
    '.' instead of '_'.

    (2) Keyrings whose names begin with a '.' aren't added to the list. Such
    keyrings are system specials.

    (3) Replace the global list with per-user_namespace lists. A keyring adds
    its name to the list for the user_namespace that it is currently in.

    (4) When a user_namespace is deleted, it just removes itself from the
    keyring name list.

    The global keyring_name_lock is retained for accessing the name lists.
    This allows (4) to work.

    This can be tested by:

    # keyctl newring foo @s
    995906392
    # unshare -U
    $ keyctl show
    ...
    995906392 --alswrv 65534 65534 \_ keyring: foo
    ...
    $ keyctl session foo
    Joined session keyring: 935622349

    As can be seen, a new session keyring was created.

    The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
    employing this feature.

    Signed-off-by: David Howells
    cc: Eric W. Biederman

    David Howells
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 Nov, 2018

1 commit

  • The current logic first clones the extent array and sorts both copies, then
    maps the lower IDs of the forward mapping into the lower namespace, but
    doesn't map the lower IDs of the reverse mapping.

    This means that code in a nested user namespace with >5 extents will see
    incorrect IDs. It also breaks some access checks, like
    inode_owner_or_capable() and privileged_wrt_inode_uidgid(), so a process
    can incorrectly appear to be capable relative to an inode.

    To fix it, we have to make sure that the "lower_first" members of extents
    in both arrays are translated; and we have to make sure that the reverse
    map is sorted *after* the translation (since otherwise the translation can
    break the sorting).

    This is CVE-2018-18955.

    Fixes: 6397fac4915a ("userns: bump idmap limits to 340")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jann Horn
    Tested-by: Eric W. Biederman
    Reviewed-by: Eric W. Biederman
    Signed-off-by: Eric W. Biederman

    Jann Horn
     

11 Aug, 2018

1 commit

  • The old code would hold the userns_state_mutex indefinitely if
    memdup_user_nul stalled due to e.g. a userfault region. Prevent that by
    moving the memdup_user_nul in front of the mutex_lock().

    Note: This changes the error precedence of invalid buf/count/*ppos vs
    map already written / capabilities missing.

    Fixes: 22d917d80e84 ("userns: Rework the user_namespace adding uid/gid...")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jann Horn
    Acked-by: Christian Brauner
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Jann Horn
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

21 Mar, 2018

1 commit

  • Unprivileged users are normally restricted from mounting with the
    allow_other option by system policy, but this could be bypassed for a mount
    done with user namespace root permissions. In such cases allow_other should
    not allow users outside the userns to access the mount as doing so would
    give the unprivileged user the ability to manipulate processes it would
    otherwise be unable to manipulate. Restrict allow_other to apply to users
    in the same userns used at mount or a descendant of that namespace. Also
    export current_in_userns() for use by fuse when built as a module.

    Reviewed-by: Serge Hallyn
    Signed-off-by: Seth Forshee
    Signed-off-by: Dongsu Park
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Miklos Szeredi

    Seth Forshee
     

17 Nov, 2017

1 commit

  • Pull user namespace update from Eric Biederman:
    "The only change that is production ready this round is the work to
    increase the number of uid and gid mappings a user namespace can
    support from 5 to 340.

    This code was carefully benchmarked and it was confirmed that in the
    existing cases the performance remains the same. In the worst case
    with 340 mappings an cache cold stat times go from 158ns to 248ns.
    That is noticable but still quite small, and only the people who are
    doing crazy things pay the cost.

    This work uncovered some documentation and cleanup opportunities in
    the mapping code, and patches to make those cleanups and improve the
    documentation will be coming in the next merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Simplify insert_extent
    userns: Make map_id_down a wrapper for map_id_range_down
    userns: Don't read extents twice in m_start
    userns: Simplify the user and group mapping functions
    userns: Don't special case a count of 0
    userns: bump idmap limits to 340
    userns: use union in {g,u}idmap struct

    Linus Torvalds
     

01 Nov, 2017

6 commits

  • Consolidate the code to write to the new mapping at the end of the
    function to remove the duplication. Move the increase in the number
    of mappings into insert_extent, keeping the logic together.

    Just a small increase in readability and maintainability.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • There is no good reason for this code duplication, the number of cache
    line accesses not the number of instructions are the bottleneck in
    this code.

    Therefore simplify maintenance by removing unnecessary code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • This is important so reading /proc//{uid_map,gid_map,projid_map} while
    the map is being written does not do strange things.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Consolidate reading the number of extents and computing the return
    value in the map_id_down, map_id_range_down and map_id_range.

    This removal of one read of extents makes one smp_rmb unnecessary
    and makes the code safe it is executed during the map write. Reading
    the number of extents twice and depending on the result being the same
    is not safe, as it could be 0 the first time and > 5 the second time,
    which would lead to misinterpreting the union fields.

    The consolidation of the return value just removes a duplicate
    caluculation which should make it easier to understand and maintain
    the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • We can always use a count of 1 so there is no reason to have
    a special case of a count of 0.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • There are quite some use cases where users run into the current limit for
    {g,u}id mappings. Consider a user requesting us to map everything but 999, and
    1001 for a given range of 1000000000 with a sub{g,u}id layout of:

    some-user:100000:1000000000
    some-user:999:1
    some-user:1000:1
    some-user:1001:1
    some-user:1002:1

    This translates to:

    MAPPING-TYPE | CONTAINER | HOST | RANGE |
    -------------|-----------|---------|-----------|
    uid | 999 | 999 | 1 |
    uid | 1001 | 1001 | 1 |
    uid | 0 | 1000000 | 999 |
    uid | 1000 | 1001000 | 1 |
    uid | 1002 | 1001002 | 999998998 |
    ------------------------------------------------
    gid | 999 | 999 | 1 |
    gid | 1001 | 1001 | 1 |
    gid | 0 | 1000000 | 999 |
    gid | 1000 | 1001000 | 1 |
    gid | 1002 | 1001002 | 999998998 |

    which is already the current limit.

    As discussed at LPC simply bumping the number of limits is not going to work
    since this would mean that struct uid_gid_map won't fit into a single cache-line
    anymore thereby regressing performance for the base-cases. The same problem
    seems to arise when using a single pointer. So the idea is to use

    struct uid_gid_extent {
    u32 first;
    u32 lower_first;
    u32 count;
    };

    struct uid_gid_map { /* 64 bytes -- 1 cache line */
    u32 nr_extents;
    union {
    struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
    struct {
    struct uid_gid_extent *forward;
    struct uid_gid_extent *reverse;
    };
    };
    };

    For the base cases we will only use the struct uid_gid_extent extent member. If
    we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
    kmalloc() which means we can have a maximum of 340 mappings
    (340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
    pointers "forward" and "reverse". The forward pointer points to an array sorted
    by "first" and the reverse pointer points to an array sorted by "lower_first".
    We can then perform binary search on those arrays.

    Performance Testing:
    When Eric introduced the extent-based struct uid_gid_map approach he measured
    the performanc impact of his idmap changes:

    > My benchmark consisted of going to single user mode where nothing else was
    > running. On an ext4 filesystem opening 1,000,000 files and looping through all
    > of the files 1000 times and calling fstat on the individuals files. This was
    > to ensure I was benchmarking stat times where the inodes were in the kernels
    > cache, but the inode values were not in the processors cache. My results:

    > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
    > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
    > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

    I used an identical approach on my laptop. Here's a thorough description of what
    I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
    booted into single user mode and used an ext4 filesystem to open/create
    1,000,000 files. Then I looped through all of the files calling fstat() on each
    of them 1000 times and calculated the mean fstat() time for a single file. (The
    test program can be found below.)

    Here are the results. For fun, I compared the first version of my patch which
    scaled linearly with the new version of the patch:

    | # MAPPINGS | PATCH-V1 | PATCH-NEW |
    |--------------|------------|-----------|
    | 0 mappings | 158 ns | 158 ns |
    | 1 mappings | 164 ns | 157 ns |
    | 2 mappings | 170 ns | 158 ns |
    | 3 mappings | 175 ns | 161 ns |
    | 5 mappings | 187 ns | 165 ns |
    | 10 mappings | 218 ns | 199 ns |
    | 50 mappings | 528 ns | 218 ns |
    | 100 mappings | 980 ns | 229 ns |
    | 200 mappings | 1880 ns | 239 ns |
    | 300 mappings | 2760 ns | 240 ns |
    | 340 mappings | not tested | 248 ns |

    Here's the test program I used. I asked Eric what he did and this is a more
    "advanced" implementation of the idea. It's pretty straight-forward:

    #define __GNU_SOURCE
    #define __STDC_FORMAT_MACROS
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    int ret;
    size_t i, k;
    int fd[1000000];
    int times[1000];
    char pathname[4096];
    struct stat st;
    struct timeval t1, t2;
    uint64_t time_in_mcs;
    uint64_t sum = 0;

    if (argc != 2) {
    fprintf(stderr, "Please specify a directory where to create "
    "the test files\n");
    exit(EXIT_FAILURE);
    }

    for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
    sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
    fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
    if (fd[i] < 0) {
    ssize_t j;
    for (j = i; j >= 0; j--)
    close(fd[j]);
    exit(EXIT_FAILURE);
    }
    }

    for (k = 0; k < 1000; k++) {
    ret = gettimeofday(&t1, NULL);
    if (ret < 0)
    goto close_all;

    for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
    ret = fstat(fd[i], &st);
    if (ret < 0)
    goto close_all;
    }

    ret = gettimeofday(&t2, NULL);
    if (ret < 0)
    goto close_all;

    time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
    (1000000 * t1.tv_sec + t1.tv_usec);
    printf("Total time in micro seconds: %" PRIu64 "\n",
    time_in_mcs);
    printf("Total time in nanoseconds: %" PRIu64 "\n",
    time_in_mcs * 1000);
    printf("Time per file in nanoseconds: %" PRIu64 "\n",
    (time_in_mcs * 1000) / 1000000);
    times[k] = (time_in_mcs * 1000) / 1000000;
    }

    close_all:
    for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
    close(fd[i]);

    if (ret < 0)
    exit(EXIT_FAILURE);

    for (k = 0; k < 1000; k++) {
    sum += times[k];
    }

    printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);

    exit(EXIT_SUCCESS);;
    }

    Signed-off-by: Christian Brauner
    CC: Serge Hallyn
    CC: Eric Biederman
    Signed-off-by: Eric W. Biederman

    Christian Brauner
     

25 Oct, 2017

1 commit

  • …READ_ONCE()/WRITE_ONCE()

    Please do not apply this to mainline directly, instead please re-run the
    coccinelle script shown below and apply its output.

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't harmful, and changing them results in
    churn.

    However, for some features, the read/write distinction is critical to
    correct operation. To distinguish these cases, separate read/write
    accessors must be used. This patch migrates (most) remaining
    ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
    coccinelle script:

    ----
    // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
    // WRITE_ONCE()

    // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: davem@davemloft.net
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Mark Rutland
     

20 Jul, 2017

1 commit

  • It is pointless and confusing to allow a pid namespace hierarchy and
    the user namespace hierarchy to get out of sync. The owner of a child
    pid namespace should be the owner of the parent pid namespace or
    a descendant of the owner of the parent pid namespace.

    Otherwise it is possible to construct scenarios where a process has a
    capability over a parent pid namespace but does not have the
    capability over a child pid namespace. Which confusingly makes
    permission checks non-transitive.

    It requires use of setns into a pid namespace (but not into a user
    namespace) to create such a scenario.

    Add the function in_userns to help in making this determination.

    v2: Optimized in_userns by using level as suggested
    by: Kirill Tkhai

    Ref: 49f4d8b93ccf ("pidns: Capture the user namespace and filter ns_last_pid")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

02 Mar, 2017

1 commit


23 Sep, 2016

4 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships.

    In a future we will use this interface to dump and restore nested
    namespaces.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Aug, 2016

4 commits


08 Aug, 2016

1 commit


24 Jun, 2016

1 commit

  • Capability sets attached to files must be ignored except in the
    user namespaces where the mounter is privileged, i.e. s_user_ns
    and its descendants. Otherwise a vector exists for gaining
    privileges in namespaces where a user is not already privileged.

    Add a new helper function, current_in_user_ns(), to test whether a user
    namespace is the same as or a descendant of another namespace.
    Use this helper to determine whether a file's capability set
    should be applied to the caps constructed during exec.

    --EWB Replaced in_userns with the simpler current_in_userns.

    Acked-by: Serge Hallyn
    Signed-off-by: Seth Forshee
    Signed-off-by: Eric W. Biederman

    Seth Forshee
     

04 Jan, 2016

1 commit


05 Sep, 2015

1 commit

  • Credit where credit is due: this idea comes from Christoph Lameter with
    a lot of valuable input from Serge Hallyn. This patch is heavily based
    on Christoph's patch.

    ===== The status quo =====

    On Linux, there are a number of capabilities defined by the kernel. To
    perform various privileged tasks, processes can wield capabilities that
    they hold.

    Each task has four capability masks: effective (pE), permitted (pP),
    inheritable (pI), and a bounding set (X). When the kernel checks for a
    capability, it checks pE. The other capability masks serve to modify
    what capabilities can be in pE.

    Any task can remove capabilities from pE, pP, or pI at any time. If a
    task has a capability in pP, it can add that capability to pE and/or pI.
    If a task has CAP_SETPCAP, then it can add any capability to pI, and it
    can remove capabilities from X.

    Tasks are not the only things that can have capabilities; files can also
    have capabilities. A file can have no capabilty information at all [1].
    If a file has capability information, then it has a permitted mask (fP)
    and an inheritable mask (fI) as well as a single effective bit (fE) [2].
    File capabilities modify the capabilities of tasks that execve(2) them.

    A task that successfully calls execve has its capabilities modified for
    the file ultimately being excecuted (i.e. the binary itself if that
    binary is ELF or for the interpreter if the binary is a script.) [3] In
    the capability evolution rules, for each mask Z, pZ represents the old
    value and pZ' represents the new value. The rules are:

    pP' = (X & fP) | (pI & fI)
    pI' = pI
    pE' = (fE ? pP' : 0)
    X is unchanged

    For setuid binaries, fP, fI, and fE are modified by a moderately
    complicated set of rules that emulate POSIX behavior. Similarly, if
    euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
    (primary, fP and fI usually end up being the full set). For nonroot
    users executing binaries with neither setuid nor file caps, fI and fP
    are empty and fE is false.

    As an extra complication, if you execute a process as nonroot and fE is
    set, then the "secure exec" rules are in effect: AT_SECURE gets set,
    LD_PRELOAD doesn't work, etc.

    This is rather messy. We've learned that making any changes is
    dangerous, though: if a new kernel version allows an unprivileged
    program to change its security state in a way that persists cross
    execution of a setuid program or a program with file caps, this
    persistent state is surprisingly likely to allow setuid or file-capped
    programs to be exploited for privilege escalation.

    ===== The problem =====

    Capability inheritance is basically useless.

    If you aren't root and you execute an ordinary binary, fI is zero, so
    your capabilities have no effect whatsoever on pP'. This means that you
    can't usefully execute a helper process or a shell command with elevated
    capabilities if you aren't root.

    On current kernels, you can sort of work around this by setting fI to
    the full set for most or all non-setuid executable files. This causes
    pP' = pI for nonroot, and inheritance works. No one does this because
    it's a PITA and it isn't even supported on most filesystems.

    If you try this, you'll discover that every nonroot program ends up with
    secure exec rules, breaking many things.

    This is a problem that has bitten many people who have tried to use
    capabilities for anything useful.

    ===== The proposed change =====

    This patch adds a fifth capability mask called the ambient mask (pA).
    pA does what most people expect pI to do.

    pA obeys the invariant that no bit can ever be set in pA if it is not
    set in both pP and pI. Dropping a bit from pP or pI drops that bit from
    pA. This ensures that existing programs that try to drop capabilities
    still do so, with a complication. Because capability inheritance is so
    broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
    then calling execve effectively drops capabilities. Therefore,
    setresuid from root to nonroot conditionally clears pA unless
    SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can
    re-add bits to pA afterwards.

    The capability evolution rules are changed:

    pA' = (file caps or setuid or setgid ? 0 : pA)
    pP' = (X & fP) | (pI & fI) | pA'
    pI' = pI
    pE' = (fE ? pP' : pA')
    X is unchanged

    If you are nonroot but you have a capability, you can add it to pA. If
    you do so, your children get that capability in pA, pP, and pE. For
    example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
    automatically bind low-numbered ports. Hallelujah!

    Unprivileged users can create user namespaces, map themselves to a
    nonzero uid, and create both privileged (relative to their namespace)
    and unprivileged process trees. This is currently more or less
    impossible. Hallelujah!

    You cannot use pA to try to subvert a setuid, setgid, or file-capped
    program: if you execute any such program, pA gets cleared and the
    resulting evolution rules are unchanged by this patch.

    Users with nonzero pA are unlikely to unintentionally leak that
    capability. If they run programs that try to drop privileges, dropping
    privileges will still work.

    It's worth noting that the degree of paranoia in this patch could
    possibly be reduced without causing serious problems. Specifically, if
    we allowed pA to persist across executing non-pA-aware setuid binaries
    and across setresuid, then, naively, the only capabilities that could
    leak as a result would be the capabilities in pA, and any attacker
    *already* has those capabilities. This would make me nervous, though --
    setuid binaries that tried to privilege-separate might fail to do so,
    and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
    unexpected side effects. (Whether these unexpected side effects would
    be exploitable is an open question.) I've therefore taken the more
    paranoid route. We can revisit this later.

    An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
    ambient capabilities. I think that this would be annoying and would
    make granting otherwise unprivileged users minor ambient capabilities
    (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
    it is with this patch.

    ===== Footnotes =====

    [1] Files that are missing the "security.capability" xattr or that have
    unrecognized values for that xattr end up with has_cap set to false.
    The code that does that appears to be complicated for no good reason.

    [2] The libcap capability mask parsers and formatters are dangerously
    misleading and the documentation is flat-out wrong. fE is *not* a mask;
    it's a single bit. This has probably confused every single person who
    has tried to use file capabilities.

    [3] Linux very confusingly processes both the script and the interpreter
    if applicable, for reasons that elude me. The results from thinking
    about a script's file capabilities and/or setuid bits are mostly
    discarded.

    Preliminary userspace code is here, but it needs updating:
    https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2

    Here is a test program that can be used to verify the functionality
    (from Christoph):

    /*
    * Test program for the ambient capabilities. This program spawns a shell
    * that allows running processes with a defined set of capabilities.
    *
    * (C) 2015 Christoph Lameter
    * Released under: GPL v3 or later.
    *
    *
    * Compile using:
    *
    * gcc -o ambient_test ambient_test.o -lcap-ng
    *
    * This program must have the following capabilities to run properly:
    * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
    *
    * A command to equip the binary with the right caps is:
    *
    * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
    *
    *
    * To get a shell with additional caps that can be inherited by other processes:
    *
    * ./ambient_test /bin/bash
    *
    *
    * Verifying that it works:
    *
    * From the bash spawed by ambient_test run
    *
    * cat /proc/$$/status
    *
    * and have a look at the capabilities.
    */

    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * Definitions from the kernel header files. These are going to be removed
    * when the /usr/include files have these defined.
    */
    #define PR_CAP_AMBIENT 47
    #define PR_CAP_AMBIENT_IS_SET 1
    #define PR_CAP_AMBIENT_RAISE 2
    #define PR_CAP_AMBIENT_LOWER 3
    #define PR_CAP_AMBIENT_CLEAR_ALL 4

    static void set_ambient_cap(int cap)
    {
    int rc;

    capng_get_caps_process();
    rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
    if (rc) {
    printf("Cannot add inheritable cap\n");
    exit(2);
    }
    capng_apply(CAPNG_SELECT_CAPS);

    /* Note the two 0s at the end. Kernel checks for these */
    if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
    perror("Cannot set cap");
    exit(1);
    }
    }

    int main(int argc, char **argv)
    {
    int rc;

    set_ambient_cap(CAP_NET_RAW);
    set_ambient_cap(CAP_NET_ADMIN);
    set_ambient_cap(CAP_SYS_NICE);

    printf("Ambient_test forking shell\n");
    if (execv(argv[1], argv + 1))
    perror("Cannot exec");

    return 0;
    }

    Signed-off-by: Christoph Lameter # Original author
    Signed-off-by: Andy Lutomirski
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc: Jonathan Corbet
    Cc: Aaron Jones
    Cc: Ted Ts'o
    Cc: Andrew G. Morgan
    Cc: Mimi Zohar
    Cc: Austin S Hemmelgarn
    Cc: Markku Savela
    Cc: Jarkko Sakkinen
    Cc: Michael Kerrisk
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

13 Aug, 2015

1 commit

  • The code that places signals in signal queues computes the uids, gids,
    and pids at the time the signals are enqueued. Which means that tasks
    that share signal queues must be in the same pid and user namespaces.

    Sharing signal handlers is fine, but bizarre.

    So make the code in fork and userns_install clearer by only testing
    for what is functionally necessary.

    Also update the comment in unshare about unsharing a user namespace to
    be a little more explicit and make a little more sense.

    Acked-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 Dec, 2014

1 commit

  • Pull user namespace related fixes from Eric Biederman:
    "As these are bug fixes almost all of thes changes are marked for
    backporting to stable.

    The first change (implicitly adding MNT_NODEV on remount) addresses a
    regression that was created when security issues with unprivileged
    remount were closed. I go on to update the remount test to make it
    easy to detect if this issue reoccurs.

    Then there are a handful of mount and umount related fixes.

    Then half of the changes deal with the a recently discovered design
    bug in the permission checks of gid_map. Unix since the beginning has
    allowed setting group permissions on files to less than the user and
    other permissions (aka ---rwx---rwx). As the unix permission checks
    stop as soon as a group matches, and setgroups allows setting groups
    that can not later be dropped, results in a situtation where it is
    possible to legitimately use a group to assign fewer privileges to a
    process. Which means dropping a group can increase a processes
    privileges.

    The fix I have adopted is that gid_map is now no longer writable
    without privilege unless the new file /proc/self/setgroups has been
    set to permanently disable setgroups.

    The bulk of user namespace using applications even the applications
    using applications using user namespaces without privilege remain
    unaffected by this change. Unfortunately this ix breaks a couple user
    space applications, that were relying on the problematic behavior (one
    of which was tools/selftests/mount/unprivileged-remount-test.c).

    To hopefully prevent needing a regression fix on top of my security
    fix I rounded folks who work with the container implementations mostly
    like to be affected and encouraged them to test the changes.

    > So far nothing broke on my libvirt-lxc test bed. :-)
    > Tested with openSUSE 13.2 and libvirt 1.2.9.
    > Tested-by: Richard Weinberger

    > Tested on Fedora20 with libvirt 1.2.11, works fine.
    > Tested-by: Chen Hanxiao

    > Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
    > Just to be sure I was testing the right thing I also tested using
    > my unprivileged nsexec testcases, and they failed on setgroup/setgid
    > as now expected, and succeeded there without your patches.
    > Tested-by: Serge Hallyn

    > I tested this with Sandstorm. It breaks as is and it works if I add
    > the setgroups thing.
    > Tested-by: Andy Lutomirski # breaks things as designed :("

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Unbreak the unprivileged remount tests
    userns; Correct the comment in map_write
    userns: Allow setting gid_maps without privilege when setgroups is disabled
    userns: Add a knob to disable setgroups on a per user namespace basis
    userns: Rename id_map_mutex to userns_state_mutex
    userns: Only allow the creator of the userns unprivileged mappings
    userns: Check euid no fsuid when establishing an unprivileged uid mapping
    userns: Don't allow unprivileged creation of gid mappings
    userns: Don't allow setgroups until a gid mapping has been setablished
    userns: Document what the invariant required for safe unprivileged mappings.
    groups: Consolidate the setgroups permission checks
    mnt: Clear mnt_expire during pivot_root
    mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
    mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
    umount: Do not allow unmounting rootfs.
    umount: Disallow unprivileged mount force
    mnt: Update unprivileged remount test
    mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount

    Linus Torvalds
     

12 Dec, 2014

3 commits

  • It is important that all maps are less than PAGE_SIZE
    or else setting the last byte of the buffer to '0'
    could write off the end of the allocated storage.

    Correct the misleading comment.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Now that setgroups can be disabled and not reenabled, setting gid_map
    without privielge can now be enabled when setgroups is disabled.

    This restores most of the functionality that was lost when unprivileged
    setting of gid_map was removed. Applications that use this functionality
    will need to check to see if they use setgroups or init_groups, and if they
    don't they can be fixed by simply disabling setgroups before writing to
    gid_map.

    Cc: stable@vger.kernel.org
    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Dec, 2014

3 commits

  • Generalize id_map_mutex so it can be used for more state of a user namespace.

    Cc: stable@vger.kernel.org
    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • If you did not create the user namespace and are allowed
    to write to uid_map or gid_map you should already have the necessary
    privilege in the parent user namespace to establish any mapping
    you want so this will not affect userspace in practice.

    Limiting unprivileged uid mapping establishment to the creator of the
    user namespace makes it easier to verify all credentials obtained with
    the uid mapping can be obtained without the uid mapping without
    privilege.

    Limiting unprivileged gid mapping establishment (which is temporarily
    absent) to the creator of the user namespace also ensures that the
    combination of uid and gid can already be obtained without privilege.

    This is part of the fix for CVE-2014-8989.

    Cc: stable@vger.kernel.org
    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • setresuid allows the euid to be set to any of uid, euid, suid, and
    fsuid. Therefor it is safe to allow an unprivileged user to map
    their euid and use CAP_SETUID privileged with exactly that uid,
    as no new credentials can be obtained.

    I can not find a combination of existing system calls that allows setting
    uid, euid, suid, and fsuid from the fsuid making the previous use
    of fsuid for allowing unprivileged mappings a bug.

    This is part of a fix for CVE-2014-8989.

    Cc: stable@vger.kernel.org
    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman