Eric Lee / smarc-fsl-linux-kernel

23 Sep, 2016

2 commits

787255966 Merge branch 'nsfs-ioctls' into HEAD ... Browse Code »

From: Andrey Vagin

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:

fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.

NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.

In addition to generic ioctl(2) errors, the following specific ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM The requested namespace is outside of the current namespace
scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Cc: "Eric W. Biederman"
Cc: James Bottomley
Cc: "Michael Kerrisk (man-pages)"
Cc: "W. Trevor King"
Cc: Alexander Viro
Cc: Serge Hallyn

Eric W. Biederman
2016-09-23 09:00:36 +0800
bcac25a58 kernel: add a helper to get an owning user namespace for a namespace ... Browse Code »

Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn
Signed-off-by: Andrei Vagin
Signed-off-by: Eric W. Biederman

Andrey Vagin
2016-09-23 08:59:39 +0800

31 Aug, 2016

1 commit

537f7ccb3 mntns: Add a limit on the number of mount namespaces. ... Browse Code »

v2: Fixed the very obvious lack of setting ucounts
on struct mnt_ns reported by Andrei Vagin, and the kbuild
test report.

Reported-by: Andrei Vagin
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-31 20:28:35 +0800

09 Aug, 2016

9 commits

703286608 netns: Add a limit on the number of net namespaces ... Browse Code »

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:42:04 +0800
d08311dd6 cgroupns: Add a limit on the number of cgroup namespaces ... Browse Code »

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:42:03 +0800
aba356616 ipcns: Add a limit on the number of ipc namespaces ... Browse Code »

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:42:03 +0800
f7af3d1c0 utsns: Add a limit on the number of uts namespaces ... Browse Code »

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:42:02 +0800
f333c700c pidns: Add a limit on the number of pid namespaces ... Browse Code »

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:42:01 +0800
25f9c0817 userns: Generalize the user namespace count into ucount ... Browse Code »

The same kind of recursive sane default limit and policy
countrol that has been implemented for the user namespace
is desirable for the other namespaces, so generalize
the user namespace refernce count into a ucount.

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:41:52 +0800
f6b2db1a3 userns: Make the count of user namespaces per user ... Browse Code »

Add a structure that is per user and per user ns and use it to hold
the count of user namespaces. This makes prevents one user from
creating denying service to another user by creating the maximum
number of user namespaces.

Rename the sysctl export of the maximum count from
/proc/sys/userns/max_user_namespaces to /proc/sys/user/max_user_namespaces
to reflect that the count is now per user.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:40:30 +0800
b376c3e1b userns: Add a limit on the number of user namespaces ... Browse Code »

Export the export the maximum number of user namespaces as
/proc/sys/userns/max_user_namespaces.

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 02:41:24 +0800
dbec28460 userns: Add per user namespace sysctls. ... Browse Code »

Limit per userns sysctls to only be opened for write by a holder
of CAP_SYS_RESOURCE.

Add all of the necessary boilerplate for having per user namespace
sysctls.

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 02:18:58 +0800

08 Aug, 2016

1 commit

b032132c3 userns: Free user namespaces in process context ... Browse Code »

Add the necessary boiler plate to move freeing of user namespaces into
work queue and thus into process context where things can sleep.

This is a necessary precursor to per user namespace sysctls.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-08 22:17:18 +0800

24 Jun, 2016

1 commit

d07b846f6 fs: Limit file caps to the user namespace of the super block ... Browse Code »

Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.

Add a new helper function, current_in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.

--EWB Replaced in_userns with the simpler current_in_userns.

Acked-by: Serge Hallyn
Signed-off-by: Seth Forshee
Signed-off-by: Eric W. Biederman

Seth Forshee
2016-06-24 23:40:31 +0800

18 Dec, 2014

1 commit

87c31b39a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace related fixes from Eric Biederman:
"As these are bug fixes almost all of thes changes are marked for
backporting to stable.

The first change (implicitly adding MNT_NODEV on remount) addresses a
regression that was created when security issues with unprivileged
remount were closed. I go on to update the remount test to make it
easy to detect if this issue reoccurs.

Then there are a handful of mount and umount related fixes.

Then half of the changes deal with the a recently discovered design
bug in the permission checks of gid_map. Unix since the beginning has
allowed setting group permissions on files to less than the user and
other permissions (aka ---rwx---rwx). As the unix permission checks
stop as soon as a group matches, and setgroups allows setting groups
that can not later be dropped, results in a situtation where it is
possible to legitimately use a group to assign fewer privileges to a
process. Which means dropping a group can increase a processes
privileges.

The fix I have adopted is that gid_map is now no longer writable
without privilege unless the new file /proc/self/setgroups has been
set to permanently disable setgroups.

The bulk of user namespace using applications even the applications
using applications using user namespaces without privilege remain
unaffected by this change. Unfortunately this ix breaks a couple user
space applications, that were relying on the problematic behavior (one
of which was tools/selftests/mount/unprivileged-remount-test.c).

To hopefully prevent needing a regression fix on top of my security
fix I rounded folks who work with the container implementations mostly
like to be affected and encouraged them to test the changes.

> So far nothing broke on my libvirt-lxc test bed. :-)
> Tested with openSUSE 13.2 and libvirt 1.2.9.
> Tested-by: Richard Weinberger

> Tested on Fedora20 with libvirt 1.2.11, works fine.
> Tested-by: Chen Hanxiao

> Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
> Just to be sure I was testing the right thing I also tested using
> my unprivileged nsexec testcases, and they failed on setgroup/setgid
> as now expected, and succeeded there without your patches.
> Tested-by: Serge Hallyn

> I tested this with Sandstorm. It breaks as is and it works if I add
> the setgroups thing.
> Tested-by: Andy Lutomirski # breaks things as designed :("

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Unbreak the unprivileged remount tests
userns; Correct the comment in map_write
userns: Allow setting gid_maps without privilege when setgroups is disabled
userns: Add a knob to disable setgroups on a per user namespace basis
userns: Rename id_map_mutex to userns_state_mutex
userns: Only allow the creator of the userns unprivileged mappings
userns: Check euid no fsuid when establishing an unprivileged uid mapping
userns: Don't allow unprivileged creation of gid mappings
userns: Don't allow setgroups until a gid mapping has been setablished
userns: Document what the invariant required for safe unprivileged mappings.
groups: Consolidate the setgroups permission checks
mnt: Clear mnt_expire during pivot_root
mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
umount: Do not allow unmounting rootfs.
umount: Disallow unprivileged mount force
mnt: Update unprivileged remount test
mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount

Linus Torvalds
2014-12-18 04:31:40 +0800

12 Dec, 2014

1 commit

9cc46516d userns: Add a knob to disable setgroups on a per user namespace basis ... Browse Code »

- Expose the knob to user space through a proc file /proc//setgroups

A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.

A value of "allow" means the segtoups system call is enabled.

- Descendant user namespaces inherit the value of setgroups from
their parents.

- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.

- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.

This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.

A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2014-12-12 08:06:36 +0800

10 Dec, 2014

1 commit

273d2c67c userns: Don't allow setgroups until a gid mapping has been setablished ... Browse Code »

setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.

The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established. Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.

This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.

Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2014-12-10 06:58:40 +0800

05 Dec, 2014

1 commit

435d5f4bb common object embedded into various struct ....ns ... Browse Code »

for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman"
Signed-off-by: Al Viro

Al Viro
2014-12-05 03:31:00 +0800

09 Aug, 2014

1 commit

ccf94f1b4 proc: constify seq_operations ... Browse Code »

proc_uid_seq_operations, proc_gid_seq_operations and
proc_projid_seq_operations are only called in proc_id_map_open with
seq_open as const struct seq_operations so we can constify the 3
structures and update proc_id_map_open prototype.

text data bss dec hex filename
6817 404 1984 9205 23f5 kernel/user_namespace.o-before
6913 308 1984 9205 23f5 kernel/user_namespace.o-after

Signed-off-by: Fabian Frederick
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-08-09 06:57:22 +0800

24 Sep, 2013

1 commit

f36f8c75a KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches ... Browse Code »

Add support for per-user_namespace registers of persistent per-UID kerberos
caches held within the kernel.

This allows the kerberos cache to be retained beyond the life of all a user's
processes so that the user's cron jobs can work.

The kerberos cache is envisioned as a keyring/key tree looking something like:

struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 big_key - A ccache blob
\___ tkt12345 big_key - Another ccache blob

Or possibly:

struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 keyring - A ccache
\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
\___ http/REDHAT.COM@REDHAT.COM user
\___ afs/REDHAT.COM@REDHAT.COM user
\___ nfs/REDHAT.COM@REDHAT.COM user
\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
\___ http/KERNEL.ORG@KERNEL.ORG big_key

What goes into a particular Kerberos cache is entirely up to userspace. Kernel
support is limited to giving you the Kerberos cache keyring that you want.

The user asks for their Kerberos cache by:

krb_cache = keyctl_get_krbcache(uid, dest_keyring);

The uid is -1 or the user's own UID for the user's own cache or the uid of some
other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to
mess with the cache.

The cache returned is a keyring named "_krb." that the possessor can read,
search, clear, invalidate, unlink from and add links to. Active LSMs get a
chance to rule on whether the caller is permitted to make a link.

Each uid's cache keyring is created when it first accessed and is given a
timeout that is extended each time this function is called so that the keyring
goes away after a while. The timeout is configurable by sysctl but defaults to
three days.

Each user_namespace struct gets a lazily-created keyring that serves as the
register. The cache keyrings are added to it. This means that standard key
search and garbage collection facilities are available.

The user_namespace struct's register goes away when it does and anything left
in it is then automatically gc'd.

Signed-off-by: David Howells
Tested-by: Simo Sorce
cc: Serge E. Hallyn
cc: Eric W. Biederman

David Howells
2013-09-24 17:35:19 +0800

08 Sep, 2013

1 commit

c7c4591db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.

The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.

Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users

Linus Torvalds
2013-09-08 05:35:32 +0800

27 Aug, 2013

1 commit

e51db7353 userns: Better restrictions on when proc and sysfs can be mounted ... Browse Code »

Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.

Verify that the mounted filesystem is not covered in any significant
way. I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.

Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs. This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-08-27 10:17:03 +0800

09 Aug, 2013

1 commit

8742f229b userns: limit the maximum depth of user_namespace->parent chain ... Browse Code »

Ensure that user_namespace->parent chain can't grow too much.
Currently we use the hardroded 32 as limit.

Reported-by: Andy Lutomirski
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-08-09 04:11:39 +0800

27 Mar, 2013

1 commit

87a8ebd63 userns: Restrict when proc and sysfs can be mounted ... Browse Code »

Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.

proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.

Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.

In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).

Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-03-27 22:50:08 +0800

27 Jan, 2013

1 commit

c61a2810a userns: Avoid recursion in put_user_ns ... Browse Code »

When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.

Remove struct kref and use a plain atomic_t. Making the code more
flexible and easier to comprehend. Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.

I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death. With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.

Acked-by: Serge Hallyn
Pointed-out-by: Vasily Kulikov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-01-27 14:11:41 +0800

20 Nov, 2012

2 commits

98f842e67 proc: Usable inode numbers for the namespace file descriptors. ... Browse Code »

Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.

A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.

This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.

We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.

I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-11-20 20:19:49 +0800
b2e0d9870 userns: Implement unshare of the user namespace ... Browse Code »

- Add CLONE_THREAD to the unshare flags if CLONE_NEWUSER is selected
As changing user namespaces is only valid if all there is only
a single thread.
- Restore the code to add CLONE_VM if CLONE_THREAD is selected and
the code to addCLONE_SIGHAND if CLONE_VM is selected.
Making the constraints in the code clear.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-20 20:18:14 +0800

18 Sep, 2012

1 commit

f76d207a6 userns: Add kprojid_t and associated infrastructure in projid.h ... Browse Code »

Implement kprojid_t a cousin of the kuid_t and kgid_t.

The per user namespace mapping of project id values can be set with
/proc//projid_map.

A full compliment of helpers is provided: make_kprojid, from_kprojid,
from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
projid_eq, projid_lt.

Project identifiers are part of the generic disk quota interface,
although it appears only xfs implements project identifiers currently.

The xfs code allows anyone who has permission to set the project
identifier on a file to use any project identifier so when
setting up the user namespace project identifier mappings I do
not require a capability.

Cc: Dave Chinner
Cc: Jan Kara
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-09-18 16:01:37 +0800

03 May, 2012

2 commits

76b6db010 userns: Replace user_ns_map_uid and user_ns_map_gid with from_kuid and from_kgid ... Browse Code »

These function are no longer needed replace them with their more useful equivalents.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-03 18:28:39 +0800
078de5f70 userns: Store uid and gid values in struct cred with kuid_t and kgid_t types ... Browse Code »

cred.h and a few trivial users of struct cred are changed. The rest of the users
of struct cred are left for other patches as there are too many changes to make
in one go and leave the change reviewable. If the user namespace is disabled and
CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile
and behave correctly.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-03 18:28:38 +0800

26 Apr, 2012

2 commits

22d917d80 userns: Rework the user_namespace adding uid/gid mapping support ... Browse Code »

- Convert the old uid mapping functions into compatibility wrappers
- Add a uid/gid mapping layer from user space uid and gids to kernel
internal uids and gids that is extent based for simplicty and speed.
* Working with number space after mapping uids/gids into their kernel
internal version adds only mapping complexity over what we have today,
leaving the kernel code easy to understand and test.
- Add proc files /proc/self/uid_map /proc/self/gid_map
These files display the mapping and allow a mapping to be added
if a mapping does not exist.
- Allow entering the user namespace without a uid or gid mapping.
Since we are starting with an existing user our uids and gids
still have global mappings so are still valid and useful they just don't
have local mappings. The requirement for things to work are global uid
and gid so it is odd but perfectly fine not to have a local uid
and gid mapping.
Not requiring global uid and gid mappings greatly simplifies
the logic of setting up the uid and gid mappings by allowing
the mappings to be set after the namespace is created which makes the
slight weirdness worth it.
- Make the mappings in the initial user namespace to the global
uid/gid space explicit. Today it is an identity mapping
but in the future we may want to twist this for debugging, similar
to what we do with jiffies.
- Document the memory ordering requirements of setting the uid and
gid mappings. We only allow the mappings to be set once
and there are no pointers involved so the requirments are
trivial but a little atypical.

Performance:

In this scheme for the permission checks the performance is expected to
stay the same as the actuall machine instructions should remain the same.

The worst case I could think of is ls -l on a large directory where
all of the stat results need to be translated with from kuids and
kgids to uids and gids. So I benchmarked that case on my laptop
with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.

My benchmark consisted of going to single user mode where nothing else
was running. On an ext4 filesystem opening 1,000,000 files and looping
through all of the files 1000 times and calling fstat on the
individuals files. This was to ensure I was benchmarking stat times
where the inodes were in the kernels cache, but the inode values were
not in the processors cache. My results:

v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

All of the configurations ran in roughly 120ns when I performed tests
that ran in the cpu cache.

So in summary the performance impact is:
1ns improvement in the worst case with user namespace support compiled out.
8ns aka 5% slowdown in the worst case with user namespace support compiled in.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-04-26 17:01:39 +0800
783291e69 userns: Simplify the user_namespace by making userns->creator a kuid. ... Browse Code »

- Transform userns->creator from a user_struct reference to a simple
kuid_t, kgid_t pair.

In cap_capable this allows the check to see if we are the creator of
a namespace to become the classic suser style euid permission check.

This allows us to remove the need for a struct cred in the mapping
functions and still be able to dispaly the user namespace creators
uid and gid as 0.

- Remove the now unnecessary delayed_work in free_user_ns.

All that is left for free_user_ns to do is to call kmem_cache_free
and put_user_ns. Those functions can be called in any context
so call them directly from free_user_ns removing the need for delayed work.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-04-26 17:00:59 +0800

08 Apr, 2012

2 commits

7b44ab978 userns: Disassociate user_struct from the user_namespace. ... Browse Code »

Modify alloc_uid to take a kuid and make the user hash table global.
Stop holding a reference to the user namespace in struct user_struct.

This simplifies the code and makes the per user accounting not
care about which user namespace a uid happens to appear in.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-04-08 08:11:46 +0800
aeb3ae9da userns: Add an explicit reference to the parent user namespace ... Browse Code »

I am about to remove the struct user_namespace reference from struct user_struct.
So keep an explicit track of the parent user namespace.

Take advantage of this new reference and replace instances of user_ns->creator->user_ns
with user_ns->parent.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-04-08 07:55:52 +0800

14 Jan, 2011

1 commit

6164281ab user_ns: improve the user_ns on-the-slab packaging ... Browse Code »

Currently on 64-bit arch the user_namespace is 2096 and when being
kmalloc-ed it resides on a 4k slab wasting 2003 bytes.

If we allocate a separate cache for it and reduce the hash size from 128
to 64 chains the packaging becomes *much* better - the struct is 1072
bytes and the hole between is 98 bytes.

[akpm@linux-foundation.org: s/__initcall/module_init/]
Signed-off-by: Pavel Emelyanov
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2011-01-14 00:03:18 +0800

17 Jun, 2010

1 commit

5c1469de7 user_ns: Introduce user_nsmap_uid and user_ns_map_gid. ... Browse Code »

Define what happens when a we view a uid from one user_namespace
in another user_namepece.

- If the user namespaces are the same no mapping is necessary.

- For most cases of difference use overflowuid and overflowgid,
the uid and gid currently used for 16bit apis when we have a 32bit uid
that does fit in 16bits. Effectively the situation is the same,
we want to return a uid or gid that is not assigned to any user.

- For the case when we happen to be mapping the uid or gid of the
creator of the target user namespace use uid 0 and gid as confusing
that user with root is not a problem.

Signed-off-by: Eric W. Biederman
Acked-by: Serge E. Hallyn
Signed-off-by: David S. Miller

Eric W. Biederman
2010-06-17 05:55:34 +0800

28 Feb, 2009

1 commit

517083667 Fix recursive lock in free_uid()/free_user_ns() ... Browse Code »

free_uid() and free_user_ns() are corecursive when CONFIG_USER_SCHED=n,
but free_user_ns() is called from free_uid() by way of uid_hash_remove(),
which requires uidhash_lock to be held. free_user_ns() then calls
free_uid() to complete the destruction.

Fix this by deferring the destruction of the user_namespace.

Signed-off-by: David Howells
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2009-02-28 08:26:21 +0800

25 Nov, 2008

1 commit

18b6e0414 User namespaces: set of cleanups (v2) ... Browse Code »

The user_ns is moved from nsproxy to user_struct, so that a struct
cred by itself is sufficient to determine access (which it otherwise
would not be). Corresponding ecryptfs fixes (by David Howells) are
here as well.

Fix refcounting. The following rules now apply:
1. The task pins the user struct.
2. The user struct pins its user namespace.
3. The user namespace pins the struct user which created it.

User namespaces are cloned during copy_creds(). Unsharing a new user_ns
is no longer possible. (We could re-add that, but it'll cause code
duplication and doesn't seem useful if PAM doesn't need to clone user
namespaces).

When a user namespace is created, its first user (uid 0) gets empty
keyrings and a clean group_info.

This incorporates a previous patch by David Howells. Here
is his original patch description:

>I suggest adding the attached incremental patch. It makes the following
>changes:
>
> (1) Provides a current_user_ns() macro to wrap accesses to current's user
> namespace.
>
> (2) Fixes eCryptFS.
>
> (3) Renames create_new_userns() to create_user_ns() to be more consistent
> with the other associated functions and because the 'new' in the name is
> superfluous.
>
> (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
> beginning of do_fork() so that they're done prior to making any attempts
> at allocation.
>
> (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
> to fill in rather than have it return the new root user. I don't imagine
> the new root user being used for anything other than filling in a cred
> struct.
>
> This also permits me to get rid of a get_uid() and a free_uid(), as the
> reference the creds were holding on the old user_struct can just be
> transferred to the new namespace's creator pointer.
>
> (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
> preparation rather than doing it in copy_creds().
>
>David

>Signed-off-by: David Howells

Changelog:
Oct 20: integrate dhowells comments
1. leave thread_keyring alone
2. use current_user_ns() in set_user()

Signed-off-by: Serge Hallyn

Serge Hallyn
2008-11-25 07:57:41 +0800

20 Sep, 2007

1 commit

735de2230 Convert uid hash to hlist ... Browse Code »

Surprisingly, but (spotted by Alexey Dobriyan) the uid hash still uses
list_heads, thus occupying twice as much place as it could. Convert it to
hlist_heads.

Signed-off-by: Pavel Emelyanov
Signed-off-by: Alexey Dobriyan
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-09-20 02:24:18 +0800

20 Jul, 2007

1 commit

626ac545c user namespace: fix copy_user_ns return value ... Browse Code »

When a CONFIG_USER_NS=n and a user tries to unshare some namespace other
than the user namespace, the dummy copy_user_ns returns NULL rather than
the old_ns.

This value then gets assigned to task->nsproxy->user_ns, so that a
subsequent setuid, which uses task->nsproxy->user_ns, causes a NULL
pointer deref.

Fix this by returning old_ns.

Signed-off-by: Serge E. Hallyn
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2007-07-20 05:05:08 +0800