05 Jun, 2020

1 commit

  • Fix the following sparse warning:

    kernel/user.c:85:19: warning: symbol 'uidhash_table' was not declared.
    Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: Jason Yan
    Signed-off-by: Andrew Morton
    Cc: David Howells
    Cc: Greg Kroah-Hartman
    Cc: Rasmus Villemoes
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200413082146.22737-1-yanaijie@huawei.com
    Signed-off-by: Linus Torvalds

    Jason Yan
     

09 Jul, 2019

1 commit

  • …/git/dhowells/linux-fs

    Pull keyring namespacing from David Howells:
    "These patches help make keys and keyrings more namespace aware.

    Firstly some miscellaneous patches to make the process easier:

    - Simplify key index_key handling so that the word-sized chunks
    assoc_array requires don't have to be shifted about, making it
    easier to add more bits into the key.

    - Cache the hash value in the key so that we don't have to calculate
    on every key we examine during a search (it involves a bunch of
    multiplications).

    - Allow keying_search() to search non-recursively.

    Then the main patches:

    - Make it so that keyring names are per-user_namespace from the point
    of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
    accessible cross-user_namespace.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

    - Move the user and user-session keyrings to the user_namespace
    rather than the user_struct. This prevents them propagating
    directly across user_namespaces boundaries (ie. the KEY_SPEC_*
    flags will only pick from the current user_namespace).

    - Make it possible to include the target namespace in which the key
    shall operate in the index_key. This will allow the possibility of
    multiple keys with the same description, but different target
    domains to be held in the same keyring.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

    - Make it so that keys are implicitly invalidated by removal of a
    domain tag, causing them to be garbage collected.

    - Institute a network namespace domain tag that allows keys to be
    differentiated by the network namespace in which they operate. New
    keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
    the network domain in force when they are created.

    - Make it so that the desired network namespace can be handed down
    into the request_key() mechanism. This allows AFS, NFS, etc. to
    request keys specific to the network namespace of the superblock.

    This also means that the keys in the DNS record cache are
    thenceforth namespaced, provided network filesystems pass the
    appropriate network namespace down into dns_query().

    For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
    cache keyrings, such as idmapper keyrings, also need to set the
    domain tag - for which they need access to the network namespace of
    the superblock"

    * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Pass the network namespace into request_key mechanism
    keys: Network namespace domain tag
    keys: Garbage collect keys for which the domain has been removed
    keys: Include target namespace in match criteria
    keys: Move the user and user-session keyrings to the user_namespace
    keys: Namespace keyring names
    keys: Add a 'recurse' flag for keyring searches
    keys: Cache the hash value to avoid lots of recalculation
    keys: Simplify key description management

    Linus Torvalds
     

27 Jun, 2019

2 commits

  • Move the user and user-session keyrings to the user_namespace struct rather
    than pinning them from the user_struct struct. This prevents these
    keyrings from propagating across user-namespaces boundaries with regard to
    the KEY_SPEC_* flags, thereby making them more useful in a containerised
    environment.

    The issue is that a single user_struct may be represent UIDs in several
    different namespaces.

    The way the patch does this is by attaching a 'register keyring' in each
    user_namespace and then sticking the user and user-session keyrings into
    that. It can then be searched to retrieve them.

    Signed-off-by: David Howells
    cc: Jann Horn

    David Howells
     
  • Keyring names are held in a single global list that any process can pick
    from by means of keyctl_join_session_keyring (provided the keyring grants
    Search permission). This isn't very container friendly, however.

    Make the following changes:

    (1) Make default session, process and thread keyring names begin with a
    '.' instead of '_'.

    (2) Keyrings whose names begin with a '.' aren't added to the list. Such
    keyrings are system specials.

    (3) Replace the global list with per-user_namespace lists. A keyring adds
    its name to the list for the user_namespace that it is currently in.

    (4) When a user_namespace is deleted, it just removes itself from the
    keyring name list.

    The global keyring_name_lock is retained for accessing the name lists.
    This allows (4) to work.

    This can be tested by:

    # keyctl newring foo @s
    995906392
    # unshare -U
    $ keyctl show
    ...
    995906392 --alswrv 65534 65534 \_ keyring: foo
    ...
    $ keyctl session foo
    Joined session keyring: 935622349

    As can be seen, a new session keyring was created.

    The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
    employing this feature.

    Signed-off-by: David Howells
    cc: Eric W. Biederman

    David Howells
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • The out_unlock label is misleading; no unlocking happens after it, so
    just return NULL directly.

    Also, nothing between the kmem_cache_zalloc() that creates new and the
    two key_put() can initialize new->uid_keyring or new->session_keyring,
    so those calls are no-ops.

    Link: http://lkml.kernel.org/r/20190424200404.9114-1-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Reviewed-by: Andrew Morton
    Cc: "Peter Zijlstra (Intel)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

23 Aug, 2018

2 commits

  • The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
    taking/releasing the spin lock. With this variant the call of
    local_irq_save/restore is no longer required.

    [bigeasy@linutronix.de: s@atomic_dec_and_lock@refcount_dec_and_lock@g]
    Link: http://lkml.kernel.org/r/20180703200141.28415-7-bigeasy@linutronix.de
    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Peter Zijlstra (Intel)
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anna-Maria Gleixner
     
  • refcount_t type and corresponding API should be used instead of atomic_t
    wh en the variable is used as a reference counter. This avoids accidental
    refcounter overflows that might lead to use-after-free situations.

    Link: http://lkml.kernel.org/r/20180703200141.28415-6-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Andrew Morton
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

23 Feb, 2018

1 commit

  • Each read from a file in efivarfs results in two calls to EFI
    (one to get the file size, another to get the actual data).

    On X86 these EFI calls result in broadcast system management
    interrupts (SMI) which affect performance of the whole system.
    A malicious user can loop performing reads from efivarfs bringing
    the system to its knees.

    Linus suggested per-user rate limit to solve this.

    So we add a ratelimit structure to "user_struct" and initialize
    it for the root user for no limit. When allocating user_struct for
    other users we set the limit to 100 per second. This could be used
    for other places that want to limit the rate of some detrimental
    user action.

    In efivarfs if the limit is exceeded when reading, we take an
    interruptible nap for 50ms and check the rate limit again.

    Signed-off-by: Tony Luck
    Acked-by: Ard Biesheuvel
    Signed-off-by: Linus Torvalds

    Luck, Tony
     

01 Nov, 2017

1 commit

  • - Add a struct containing two pointer to extents and wrap both the static extent
    array and the struct into a union. This is done in preparation for bumping the
    {g,u}idmap limits for user namespaces.
    - Add brackets around anonymous union when using designated initializers to
    initialize members in order to please gcc
    Signed-off-by: Eric W. Biederman

    Christian Brauner
     

02 Mar, 2017

1 commit


18 Dec, 2014

1 commit

  • Pull user namespace related fixes from Eric Biederman:
    "As these are bug fixes almost all of thes changes are marked for
    backporting to stable.

    The first change (implicitly adding MNT_NODEV on remount) addresses a
    regression that was created when security issues with unprivileged
    remount were closed. I go on to update the remount test to make it
    easy to detect if this issue reoccurs.

    Then there are a handful of mount and umount related fixes.

    Then half of the changes deal with the a recently discovered design
    bug in the permission checks of gid_map. Unix since the beginning has
    allowed setting group permissions on files to less than the user and
    other permissions (aka ---rwx---rwx). As the unix permission checks
    stop as soon as a group matches, and setgroups allows setting groups
    that can not later be dropped, results in a situtation where it is
    possible to legitimately use a group to assign fewer privileges to a
    process. Which means dropping a group can increase a processes
    privileges.

    The fix I have adopted is that gid_map is now no longer writable
    without privilege unless the new file /proc/self/setgroups has been
    set to permanently disable setgroups.

    The bulk of user namespace using applications even the applications
    using applications using user namespaces without privilege remain
    unaffected by this change. Unfortunately this ix breaks a couple user
    space applications, that were relying on the problematic behavior (one
    of which was tools/selftests/mount/unprivileged-remount-test.c).

    To hopefully prevent needing a regression fix on top of my security
    fix I rounded folks who work with the container implementations mostly
    like to be affected and encouraged them to test the changes.

    > So far nothing broke on my libvirt-lxc test bed. :-)
    > Tested with openSUSE 13.2 and libvirt 1.2.9.
    > Tested-by: Richard Weinberger

    > Tested on Fedora20 with libvirt 1.2.11, works fine.
    > Tested-by: Chen Hanxiao

    > Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
    > Just to be sure I was testing the right thing I also tested using
    > my unprivileged nsexec testcases, and they failed on setgroup/setgid
    > as now expected, and succeeded there without your patches.
    > Tested-by: Serge Hallyn

    > I tested this with Sandstorm. It breaks as is and it works if I add
    > the setgroups thing.
    > Tested-by: Andy Lutomirski # breaks things as designed :("

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Unbreak the unprivileged remount tests
    userns; Correct the comment in map_write
    userns: Allow setting gid_maps without privilege when setgroups is disabled
    userns: Add a knob to disable setgroups on a per user namespace basis
    userns: Rename id_map_mutex to userns_state_mutex
    userns: Only allow the creator of the userns unprivileged mappings
    userns: Check euid no fsuid when establishing an unprivileged uid mapping
    userns: Don't allow unprivileged creation of gid mappings
    userns: Don't allow setgroups until a gid mapping has been setablished
    userns: Document what the invariant required for safe unprivileged mappings.
    groups: Consolidate the setgroups permission checks
    mnt: Clear mnt_expire during pivot_root
    mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
    mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
    umount: Do not allow unmounting rootfs.
    umount: Disallow unprivileged mount force
    mnt: Update unprivileged remount test
    mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount

    Linus Torvalds
     

12 Dec, 2014

1 commit

  • - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Dec, 2014

2 commits


05 Jun, 2014

1 commit


04 Apr, 2014

1 commit

  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    kernel/user.c obj-y
    kernel/kexec.c bool KEXEC (one instance per arch)
    kernel/profile.c bool PROFILING
    kernel/hung_task.c bool DETECT_HUNG_TASK
    kernel/sched/stats.c bool SCHEDSTATS
    kernel/user_namespace.c bool USER_NS

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier). However no observable impact of that
    difference has been observed during testing.

    Also, two instances of missing ";" at EOL are fixed in kexec.

    Signed-off-by: Paul Gortmaker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

13 Dec, 2013

1 commit

  • We run into this bug:
    [ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
    [ 2736.063293] Faulting instruction address: 0xc00000000037efb0
    [ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
    [ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
    [ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
    [ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
    [ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
    [ 2736.063399] REGS: c0000001319a3870 TRAP: 0300 Not tainted (3.10.0-48.el7.ppc64)
    [ 2736.063403] MSR: 8000000000009032 CR: 28824242 XER: 20000000
    [ 2736.063415] SOFTE: 0
    [ 2736.063418] CFAR: c00000000000908c
    [ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
    [ 2736.063425]
    GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
    GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
    GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
    GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
    GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
    GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
    GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
    [ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
    [ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
    [ 2736.063508] PACATMSCRATCH [800000000280f032]
    [ 2736.063511] Call Trace:
    [ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
    [ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
    [ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
    [ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
    [ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
    [ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
    [ 2736.063553] Instruction dump:
    [ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
    [ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 7fbd2840 40de0094 7fbff040
    [ 2736.063579] ---[ end trace 2708241785538296 ]---

    It's caused by uninitialized persistent_keyring_register_sem.

    The bug was introduced by commit f36f8c75, two typos are in that commit:
    CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
    krb_cache_register_sem should be persistent_keyring_register_sem.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: David Howells

    Xiao Guangrong
     

24 Sep, 2013

1 commit

  • Add support for per-user_namespace registers of persistent per-UID kerberos
    caches held within the kernel.

    This allows the kerberos cache to be retained beyond the life of all a user's
    processes so that the user's cron jobs can work.

    The kerberos cache is envisioned as a keyring/key tree looking something like:

    struct user_namespace
    \___ .krb_cache keyring - The register
    \___ _krb.0 keyring - Root's Kerberos cache
    \___ _krb.5000 keyring - User 5000's Kerberos cache
    \___ _krb.5001 keyring - User 5001's Kerberos cache
    \___ tkt785 big_key - A ccache blob
    \___ tkt12345 big_key - Another ccache blob

    Or possibly:

    struct user_namespace
    \___ .krb_cache keyring - The register
    \___ _krb.0 keyring - Root's Kerberos cache
    \___ _krb.5000 keyring - User 5000's Kerberos cache
    \___ _krb.5001 keyring - User 5001's Kerberos cache
    \___ tkt785 keyring - A ccache
    \___ krbtgt/REDHAT.COM@REDHAT.COM big_key
    \___ http/REDHAT.COM@REDHAT.COM user
    \___ afs/REDHAT.COM@REDHAT.COM user
    \___ nfs/REDHAT.COM@REDHAT.COM user
    \___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
    \___ http/KERNEL.ORG@KERNEL.ORG big_key

    What goes into a particular Kerberos cache is entirely up to userspace. Kernel
    support is limited to giving you the Kerberos cache keyring that you want.

    The user asks for their Kerberos cache by:

    krb_cache = keyctl_get_krbcache(uid, dest_keyring);

    The uid is -1 or the user's own UID for the user's own cache or the uid of some
    other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to
    mess with the cache.

    The cache returned is a keyring named "_krb." that the possessor can read,
    search, clear, invalidate, unlink from and add links to. Active LSMs get a
    chance to rule on whether the caller is permitted to make a link.

    Each uid's cache keyring is created when it first accessed and is given a
    timeout that is extended each time this function is called so that the keyring
    goes away after a while. The timeout is configurable by sysctl but defaults to
    three days.

    Each user_namespace struct gets a lazily-created keyring that serves as the
    register. The cache keyrings are added to it. This means that standard key
    search and garbage collection facilities are available.

    The user_namespace struct's register goes away when it does and anything left
    in it is then automatically gc'd.

    Signed-off-by: David Howells
    Tested-by: Simo Sorce
    cc: Serge E. Hallyn
    cc: Eric W. Biederman

    David Howells
     

27 Aug, 2013

1 commit

  • Rely on the fact that another flavor of the filesystem is already
    mounted and do not rely on state in the user namespace.

    Verify that the mounted filesystem is not covered in any significant
    way. I would love to verify that the previously mounted filesystem
    has no mounts on top but there are at least the directories
    /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
    for other filesystems to mount on top of.

    Refactor the test into a function named fs_fully_visible and call that
    function from the mount routines of proc and sysfs. This makes this
    test local to the filesystems involved and the results current of when
    the mounts take place, removing a weird threading of the user
    namespace, the mount namespace and the filesystems themselves.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     

27 Mar, 2013

1 commit

  • Only allow unprivileged mounts of proc and sysfs if they are already
    mounted when the user namespace is created.

    proc and sysfs are interesting because they have content that is
    per namespace, and so fresh mounts are needed when new namespaces
    are created while at the same time proc and sysfs have content that
    is shared between every instance.

    Respect the policy of who may see the shared content of proc and sysfs
    by only allowing new mounts if there was an existing mount at the time
    the user namespace was created.

    In practice there are only two interesting cases: proc and sysfs are
    mounted at their usual places, proc and sysfs are not mounted at all
    (some form of mount namespace jail).

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

27 Jan, 2013

1 commit

  • When freeing a deeply nested user namespace free_user_ns calls
    put_user_ns on it's parent which may in turn call free_user_ns again.
    When -fno-optimize-sibling-calls is passed to gcc one stack frame per
    user namespace is left on the stack, potentially overflowing the
    kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
    so we can't count on gcc to optimize this code.

    Remove struct kref and use a plain atomic_t. Making the code more
    flexible and easier to comprehend. Make the loop in free_user_ns
    explict to guarantee that the stack does not overflow with
    CONFIG_FRAME_POINTER enabled.

    I have tested this fix with a simple program that uses unshare to
    create a deeply nested user namespace structure and then calls exit.
    With 1000 nesteuser namespaces before this change running my test
    program causes the kernel to die a horrible death. With 10,000,000
    nested user namespaces after this change my test program runs to
    completion and causes no harm.

    Acked-by: Serge Hallyn
    Pointed-out-by: Vasily Kulikov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

18 Sep, 2012

1 commit

  • Implement kprojid_t a cousin of the kuid_t and kgid_t.

    The per user namespace mapping of project id values can be set with
    /proc//projid_map.

    A full compliment of helpers is provided: make_kprojid, from_kprojid,
    from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
    projid_eq, projid_lt.

    Project identifiers are part of the generic disk quota interface,
    although it appears only xfs implements project identifiers currently.

    The xfs code allows anyone who has permission to set the project
    identifier on a file to use any project identifier so when
    setting up the user namespace project identifier mappings I do
    not require a capability.

    Cc: Dave Chinner
    Cc: Jan Kara
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 May, 2012

1 commit

  • On 32bit builds gcc says:
    kernel/user.c:30:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]
    kernel/user.c:38:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]

    Silence gcc by changing the constant 4294967295 to 4294967295U.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

26 Apr, 2012

2 commits

  • - Convert the old uid mapping functions into compatibility wrappers
    - Add a uid/gid mapping layer from user space uid and gids to kernel
    internal uids and gids that is extent based for simplicty and speed.
    * Working with number space after mapping uids/gids into their kernel
    internal version adds only mapping complexity over what we have today,
    leaving the kernel code easy to understand and test.
    - Add proc files /proc/self/uid_map /proc/self/gid_map
    These files display the mapping and allow a mapping to be added
    if a mapping does not exist.
    - Allow entering the user namespace without a uid or gid mapping.
    Since we are starting with an existing user our uids and gids
    still have global mappings so are still valid and useful they just don't
    have local mappings. The requirement for things to work are global uid
    and gid so it is odd but perfectly fine not to have a local uid
    and gid mapping.
    Not requiring global uid and gid mappings greatly simplifies
    the logic of setting up the uid and gid mappings by allowing
    the mappings to be set after the namespace is created which makes the
    slight weirdness worth it.
    - Make the mappings in the initial user namespace to the global
    uid/gid space explicit. Today it is an identity mapping
    but in the future we may want to twist this for debugging, similar
    to what we do with jiffies.
    - Document the memory ordering requirements of setting the uid and
    gid mappings. We only allow the mappings to be set once
    and there are no pointers involved so the requirments are
    trivial but a little atypical.

    Performance:

    In this scheme for the permission checks the performance is expected to
    stay the same as the actuall machine instructions should remain the same.

    The worst case I could think of is ls -l on a large directory where
    all of the stat results need to be translated with from kuids and
    kgids to uids and gids. So I benchmarked that case on my laptop
    with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.

    My benchmark consisted of going to single user mode where nothing else
    was running. On an ext4 filesystem opening 1,000,000 files and looping
    through all of the files 1000 times and calling fstat on the
    individuals files. This was to ensure I was benchmarking stat times
    where the inodes were in the kernels cache, but the inode values were
    not in the processors cache. My results:

    v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
    v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
    v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

    All of the configurations ran in roughly 120ns when I performed tests
    that ran in the cpu cache.

    So in summary the performance impact is:
    1ns improvement in the worst case with user namespace support compiled out.
    8ns aka 5% slowdown in the worst case with user namespace support compiled in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Transform userns->creator from a user_struct reference to a simple
    kuid_t, kgid_t pair.

    In cap_capable this allows the check to see if we are the creator of
    a namespace to become the classic suser style euid permission check.

    This allows us to remove the need for a struct cred in the mapping
    functions and still be able to dispaly the user namespace creators
    uid and gid as 0.

    - Remove the now unnecessary delayed_work in free_user_ns.

    All that is left for free_user_ns to do is to call kmem_cache_free
    and put_user_ns. Those functions can be called in any context
    so call them directly from free_user_ns removing the need for delayed work.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

08 Apr, 2012

2 commits

  • Modify alloc_uid to take a kuid and make the user hash table global.
    Stop holding a reference to the user namespace in struct user_struct.

    This simplifies the code and makes the per user accounting not
    care about which user namespace a uid happens to appear in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • With a user_ns reference in struct cred the only user of the user namespace
    reference in struct user_struct is to keep the uid hash table alive.

    The user_namespace reference in struct user_struct will be going away soon, and
    I have removed all of the references. Rename the field from user_ns to _user_ns
    so that the compiler can verify nothing follows the user struct to the user
    namespace anymore.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

24 Mar, 2011

1 commit

  • The expected course of development for user namespaces targeted
    capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

    Goals:

    - Make it safe for an unprivileged user to unshare namespaces. They
    will be privileged with respect to the new namespace, but this should
    only include resources which the unprivileged user already owns.

    - Provide separate limits and accounting for userids in different
    namespaces.

    Status:

    Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
    get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
    CAP_SETGID capabilities. What this gets you is a whole new set of
    userids, meaning that user 500 will have a different 'struct user' in
    your namespace than in other namespaces. So any accounting information
    stored in struct user will be unique to your namespace.

    However, throughout the kernel there are checks which

    - simply check for a capability. Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

    - simply compare uid1 == uid2. Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

    As a result, the lxc implementation at lxc.sf.net does not use user
    namespaces. This is actually helpful because it leaves us free to
    develop user namespaces in such a way that, for some time, user
    namespaces may be unuseful.

    Bugs aside, this patchset is supposed to not at all affect systems which
    are not actively using user namespaces, and only restrict what tasks in
    child user namespace can do. They begin to limit privilege to a user
    namespace, so that root in a container cannot kill or ptrace tasks in the
    parent user namespace, and can only get world access rights to files.
    Since all files currently belong to the initila user namespace, that means
    that child user namespaces can only get world access rights to *all*
    files. While this temporarily makes user namespaces bad for system
    containers, it starts to get useful for some sandboxing.

    I've run the 'runltplite.sh' with and without this patchset and found no
    difference.

    This patch:

    copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
    So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
    will have the new user namespace as its owner. That is what we want,
    since we want root in that new userns to be able to have privilege over
    it.

    Changelog:
    Feb 15: don't set uts_ns->user_ns if we didn't create
    a new uts_ns.
    Feb 23: Move extern init_user_ns declaration from
    init/version.c to utsname.h.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

30 Dec, 2010

1 commit

  • When racing on adding into user cache, the new allocated from mm slab
    is freed without putting user namespace.

    Since the user namespace is already operated by getting, putting has
    to be issued.

    Signed-off-by: Hillf Danton
    Acked-by: Serge Hallyn
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

27 Oct, 2010

1 commit

  • free_user() releases uidhash_lock but was missing annotation. Add it.
    This removes following sparse warnings:

    include/linux/spinlock.h:339:9: warning: context imbalance in 'free_user' - unexpected unlock
    kernel/user.c:120:6: warning: context imbalance in 'free_uid' - wrong count at exit

    Signed-off-by: Namhyung Kim
    Cc: Ingo Molnar
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

10 May, 2010

1 commit

  • This comment should have been removed together with uids_mutex
    when removing user sched.

    Signed-off-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Dhaval Giani
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     

03 Apr, 2010

1 commit


16 Mar, 2010

1 commit


21 Jan, 2010

1 commit

  • Remove the USER_SCHED feature. It has been scheduled to be removed in
    2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2

    Signed-off-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dhaval Giani