14 Jan, 2020
1 commit
-
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.This scheme allows setting clock offsets for a namespace, before any
processes appear in it.All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]Co-developed-by: Dmitry Safonov
Signed-off-by: Andrei Vagin
Signed-off-by: Dmitry Safonov
Signed-off-by: Thomas Gleixner
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
27 Jun, 2019
2 commits
-
Move the user and user-session keyrings to the user_namespace struct rather
than pinning them from the user_struct struct. This prevents these
keyrings from propagating across user-namespaces boundaries with regard to
the KEY_SPEC_* flags, thereby making them more useful in a containerised
environment.The issue is that a single user_struct may be represent UIDs in several
different namespaces.The way the patch does this is by attaching a 'register keyring' in each
user_namespace and then sticking the user and user-session keyrings into
that. It can then be searched to retrieve them.Signed-off-by: David Howells
cc: Jann Horn -
Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission). This isn't very container friendly, however.Make the following changes:
(1) Make default session, process and thread keyring names begin with a
'.' instead of '_'.(2) Keyrings whose names begin with a '.' aren't added to the list. Such
keyrings are system specials.(3) Replace the global list with per-user_namespace lists. A keyring adds
its name to the list for the user_namespace that it is currently in.(4) When a user_namespace is deleted, it just removes itself from the
keyring name list.The global keyring_name_lock is retained for accessing the name lists.
This allows (4) to work.This can be tested by:
# keyctl newring foo @s
995906392
# unshare -U
$ keyctl show
...
995906392 --alswrv 65534 65534 \_ keyring: foo
...
$ keyctl session foo
Joined session keyring: 935622349As can be seen, a new session keyring was created.
The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
employing this feature.Signed-off-by: David Howells
cc: Eric W. Biederman
17 Nov, 2017
1 commit
-
Pull user namespace update from Eric Biederman:
"The only change that is production ready this round is the work to
increase the number of uid and gid mappings a user namespace can
support from 5 to 340.This code was carefully benchmarked and it was confirmed that in the
existing cases the performance remains the same. In the worst case
with 340 mappings an cache cold stat times go from 158ns to 248ns.
That is noticable but still quite small, and only the people who are
doing crazy things pay the cost.This work uncovered some documentation and cleanup opportunities in
the mapping code, and patches to make those cleanups and improve the
documentation will be coming in the next merge window"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Simplify insert_extent
userns: Make map_id_down a wrapper for map_id_range_down
userns: Don't read extents twice in m_start
userns: Simplify the user and group mapping functions
userns: Don't special case a count of 0
userns: bump idmap limits to 340
userns: use union in {g,u}idmap struct
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
01 Nov, 2017
2 commits
-
There are quite some use cases where users run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1This translates to:
MAPPING-TYPE | CONTAINER | HOST | RANGE |
-------------|-----------|---------|-----------|
uid | 999 | 999 | 1 |
uid | 1001 | 1001 | 1 |
uid | 0 | 1000000 | 999 |
uid | 1000 | 1001000 | 1 |
uid | 1002 | 1001002 | 999998998 |
------------------------------------------------
gid | 999 | 999 | 1 |
gid | 1001 | 1001 | 1 |
gid | 0 | 1000000 | 999 |
gid | 1000 | 1001000 | 1 |
gid | 1002 | 1001002 | 999998998 |which is already the current limit.
As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to usestruct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
};struct uid_gid_map { /* 64 bytes -- 1 cache line */
u32 nr_extents;
union {
struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
struct {
struct uid_gid_extent *forward;
struct uid_gid_extent *reverse;
};
};
};For the base cases we will only use the struct uid_gid_extent extent member. If
we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
kmalloc() which means we can have a maximum of 340 mappings
(340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
pointers "forward" and "reverse". The forward pointer points to an array sorted
by "first" and the reverse pointer points to an array sorted by "lower_first".
We can then perform binary search on those arrays.Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:> v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)I used an identical approach on my laptop. Here's a thorough description of what
I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
booted into single user mode and used an ext4 filesystem to open/create
1,000,000 files. Then I looped through all of the files calling fstat() on each
of them 1000 times and calculated the mean fstat() time for a single file. (The
test program can be found below.)Here are the results. For fun, I compared the first version of my patch which
scaled linearly with the new version of the patch:| # MAPPINGS | PATCH-V1 | PATCH-NEW |
|--------------|------------|-----------|
| 0 mappings | 158 ns | 158 ns |
| 1 mappings | 164 ns | 157 ns |
| 2 mappings | 170 ns | 158 ns |
| 3 mappings | 175 ns | 161 ns |
| 5 mappings | 187 ns | 165 ns |
| 10 mappings | 218 ns | 199 ns |
| 50 mappings | 528 ns | 218 ns |
| 100 mappings | 980 ns | 229 ns |
| 200 mappings | 1880 ns | 239 ns |
| 300 mappings | 2760 ns | 240 ns |
| 340 mappings | not tested | 248 ns |Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:#define __GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#includeint main(int argc, char *argv[])
{
int ret;
size_t i, k;
int fd[1000000];
int times[1000];
char pathname[4096];
struct stat st;
struct timeval t1, t2;
uint64_t time_in_mcs;
uint64_t sum = 0;if (argc != 2) {
fprintf(stderr, "Please specify a directory where to create "
"the test files\n");
exit(EXIT_FAILURE);
}for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
if (fd[i] < 0) {
ssize_t j;
for (j = i; j >= 0; j--)
close(fd[j]);
exit(EXIT_FAILURE);
}
}for (k = 0; k < 1000; k++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto close_all;for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
ret = fstat(fd[i], &st);
if (ret < 0)
goto close_all;
}ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto close_all;time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
(1000000 * t1.tv_sec + t1.tv_usec);
printf("Total time in micro seconds: %" PRIu64 "\n",
time_in_mcs);
printf("Total time in nanoseconds: %" PRIu64 "\n",
time_in_mcs * 1000);
printf("Time per file in nanoseconds: %" PRIu64 "\n",
(time_in_mcs * 1000) / 1000000);
times[k] = (time_in_mcs * 1000) / 1000000;
}close_all:
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
close(fd[i]);if (ret < 0)
exit(EXIT_FAILURE);for (k = 0; k < 1000; k++) {
sum += times[k];
}printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);
exit(EXIT_SUCCESS);;
}Signed-off-by: Christian Brauner
CC: Serge Hallyn
CC: Eric Biederman
Signed-off-by: Eric W. Biederman -
- Add a struct containing two pointer to extents and wrap both the static extent
array and the struct into a union. This is done in preparation for bumping the
{g,u}idmap limits for user namespaces.
- Add brackets around anonymous union when using designated initializers to
initialize members in order to please gcc
Signed-off-by: Eric W. Biederman
12 Sep, 2017
1 commit
-
Pull namespace updates from Eric Biederman:
"Life has been busy and I have not gotten half as much done this round
as I would have liked. I delayed it so that a minor conflict
resolution with the mips tree could spend a little time in linux-next
before I sent this pull request.This includes two long delayed user namespace changes from Kirill
Tkhai. It also includes a very useful change from Serge Hallyn that
allows the security capability attribute to be used inside of user
namespaces. The practical effect of this is people can now untar
tarballs and install rpms in user namespaces. It had been suggested to
generalize this and encode some of the namespace information
information in the xattr name. Upon close inspection that makes the
things that should be hard easy and the things that should be easy
more expensive.Then there is my bugfix/cleanup for signal injection that removes the
magic encoding of the siginfo union member from the kernel internal
si_code. The mips folks reported the case where I had used FPE_FIXME
me is impossible so I have remove FPE_FIXME from mips, while at the
same time including a return statement in that case to keep gcc from
complaining about unitialized variables.I almost finished the work to get make copy_siginfo_to_user a trivial
copy to user. The code is available at:git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3
But I did not have time/energy to get the code posted and reviewed
before the merge window opened.I was able to see that the security excuse for just copying fields
that we know are initialized doesn't work in practice there are buggy
initializations that don't initialize the proper fields in siginfo. So
we still sometimes copy unitialized data to userspace"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
Introduce v3 namespaced file capabilities
mips/signal: In force_fcr31_sig return in the impossible case
signal: Remove kernel interal si_code magic
fcntl: Don't use ambiguous SIG_POLL si_codes
prctl: Allow local CAP_SYS_ADMIN changing exe_file
security: Use user_namespace::level to avoid redundant iterations in cap_capable()
userns,pidns: Verify the userns for new pid namespaces
signal/testing: Don't look for __SI_FAULT in userspace
signal/mips: Document a conflict with SI_USER with SIGFPE
signal/sparc: Document a conflict with SI_USER with SIGFPE
signal/ia64: Document a conflict with SI_USER with SIGFPE
signal/alpha: Document a conflict with SI_USER for SIGTRAP
20 Jul, 2017
1 commit
-
It is pointless and confusing to allow a pid namespace hierarchy and
the user namespace hierarchy to get out of sync. The owner of a child
pid namespace should be the owner of the parent pid namespace or
a descendant of the owner of the parent pid namespace.Otherwise it is possible to construct scenarios where a process has a
capability over a parent pid namespace but does not have the
capability over a child pid namespace. Which confusingly makes
permission checks non-transitive.It requires use of setns into a pid namespace (but not into a user
namespace) to create such a scenario.Add the function in_userns to help in making this determination.
v2: Optimized in_userns by using level as suggested
by: Kirill TkhaiRef: 49f4d8b93ccf ("pidns: Capture the user namespace and filter ns_last_pid")
Signed-off-by: "Eric W. Biederman"
01 Jul, 2017
1 commit
-
This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.Signed-off-by: Kees Cook
07 Mar, 2017
1 commit
-
Always increment/decrement ucount->count under the ucounts_lock. The
increments are there already and moving the decrements there means the
locking logic of the code is simpler. This simplification in the
locking logic fixes a race between put_ucounts and get_ucounts that
could result in a use-after-free because the count could go zero then
be found by get_ucounts and then be freed by put_ucounts.A bug presumably this one was found by a combination of syzkaller and
KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
spotted the race in the code.Cc: stable@vger.kernel.org
Fixes: f6b2db1a3e8d ("userns: Make the count of user namespaces per user")
Reported-by: JongHwan Kim
Reported-by: Dmitry Vyukov
Reviewed-by: Andrei Vagin
Signed-off-by: "Eric W. Biederman"
03 Mar, 2017
2 commits
-
So we want to simplify 's header dependencies, but one
roadblock of that is 's inclusion of sysctl.h,
which brings in other, problematic headers.Note that timer.h's inclusion of sysctl.h can be avoided if we
pre-declare ctl_table - so do that.Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar -
This is a stray header that is not needed by anything in sched.h,
so remove it.Update files that relied on the stray inclusion.
This reduces the size of the header dependency graph.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar
02 Mar, 2017
1 commit
-
We are going to remove the following header inclusions from :
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#includeFix up a single .h file that got hold of via one of these headers.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
24 Jan, 2017
1 commit
-
This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.Signed-off-by: Nikolay Borisov
Acked-by: Jan Kara
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman
23 Sep, 2016
2 commits
-
From: Andrey Vagin
Each namespace has an owning user namespace and now there is not way
to discover these relationships.Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.Why we may want to know relationships between namespaces?
One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.There [1] was a discussion about which interface to choose to determing
relationships between namespaces.Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.Here is an implementaions of these ioctl-s.
$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:fd = ioctl(ns_fd, ioctl_type);
where ioctl_type is one of the following:
NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.In addition to generic ioctl(2) errors, the following specific ones
can occur:EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
EPERM The requested namespace is outside of the current namespace
scope.[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // EricChanges for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.Cc: "Eric W. Biederman"
Cc: James Bottomley
Cc: "Michael Kerrisk (man-pages)"
Cc: "W. Trevor King"
Cc: Alexander Viro
Cc: Serge Hallyn -
Return -EPERM if an owning user namespace is outside of a process
current user namespace.v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.Acked-by: Serge Hallyn
Signed-off-by: Andrei Vagin
Signed-off-by: Eric W. Biederman
31 Aug, 2016
1 commit
-
v2: Fixed the very obvious lack of setting ucounts
on struct mnt_ns reported by Andrei Vagin, and the kbuild
test report.Reported-by: Andrei Vagin
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"
09 Aug, 2016
9 commits
-
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman" -
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman" -
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman" -
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman" -
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman" -
The same kind of recursive sane default limit and policy
countrol that has been implemented for the user namespace
is desirable for the other namespaces, so generalize
the user namespace refernce count into a ucount.Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman" -
Add a structure that is per user and per user ns and use it to hold
the count of user namespaces. This makes prevents one user from
creating denying service to another user by creating the maximum
number of user namespaces.Rename the sysctl export of the maximum count from
/proc/sys/userns/max_user_namespaces to /proc/sys/user/max_user_namespaces
to reflect that the count is now per user.Signed-off-by: "Eric W. Biederman"
-
Export the export the maximum number of user namespaces as
/proc/sys/userns/max_user_namespaces.Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman" -
Limit per userns sysctls to only be opened for write by a holder
of CAP_SYS_RESOURCE.Add all of the necessary boilerplate for having per user namespace
sysctls.Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"
08 Aug, 2016
1 commit
-
Add the necessary boiler plate to move freeing of user namespaces into
work queue and thus into process context where things can sleep.This is a necessary precursor to per user namespace sysctls.
Signed-off-by: "Eric W. Biederman"
24 Jun, 2016
1 commit
-
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.Add a new helper function, current_in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.--EWB Replaced in_userns with the simpler current_in_userns.
Acked-by: Serge Hallyn
Signed-off-by: Seth Forshee
Signed-off-by: Eric W. Biederman
18 Dec, 2014
1 commit
-
Pull user namespace related fixes from Eric Biederman:
"As these are bug fixes almost all of thes changes are marked for
backporting to stable.The first change (implicitly adding MNT_NODEV on remount) addresses a
regression that was created when security issues with unprivileged
remount were closed. I go on to update the remount test to make it
easy to detect if this issue reoccurs.Then there are a handful of mount and umount related fixes.
Then half of the changes deal with the a recently discovered design
bug in the permission checks of gid_map. Unix since the beginning has
allowed setting group permissions on files to less than the user and
other permissions (aka ---rwx---rwx). As the unix permission checks
stop as soon as a group matches, and setgroups allows setting groups
that can not later be dropped, results in a situtation where it is
possible to legitimately use a group to assign fewer privileges to a
process. Which means dropping a group can increase a processes
privileges.The fix I have adopted is that gid_map is now no longer writable
without privilege unless the new file /proc/self/setgroups has been
set to permanently disable setgroups.The bulk of user namespace using applications even the applications
using applications using user namespaces without privilege remain
unaffected by this change. Unfortunately this ix breaks a couple user
space applications, that were relying on the problematic behavior (one
of which was tools/selftests/mount/unprivileged-remount-test.c).To hopefully prevent needing a regression fix on top of my security
fix I rounded folks who work with the container implementations mostly
like to be affected and encouraged them to test the changes.> So far nothing broke on my libvirt-lxc test bed. :-)
> Tested with openSUSE 13.2 and libvirt 1.2.9.
> Tested-by: Richard Weinberger> Tested on Fedora20 with libvirt 1.2.11, works fine.
> Tested-by: Chen Hanxiao> Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
> Just to be sure I was testing the right thing I also tested using
> my unprivileged nsexec testcases, and they failed on setgroup/setgid
> as now expected, and succeeded there without your patches.
> Tested-by: Serge Hallyn> I tested this with Sandstorm. It breaks as is and it works if I add
> the setgroups thing.
> Tested-by: Andy Lutomirski # breaks things as designed :("* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Unbreak the unprivileged remount tests
userns; Correct the comment in map_write
userns: Allow setting gid_maps without privilege when setgroups is disabled
userns: Add a knob to disable setgroups on a per user namespace basis
userns: Rename id_map_mutex to userns_state_mutex
userns: Only allow the creator of the userns unprivileged mappings
userns: Check euid no fsuid when establishing an unprivileged uid mapping
userns: Don't allow unprivileged creation of gid mappings
userns: Don't allow setgroups until a gid mapping has been setablished
userns: Document what the invariant required for safe unprivileged mappings.
groups: Consolidate the setgroups permission checks
mnt: Clear mnt_expire during pivot_root
mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
umount: Do not allow unmounting rootfs.
umount: Disallow unprivileged mount force
mnt: Update unprivileged remount test
mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
12 Dec, 2014
1 commit
-
- Expose the knob to user space through a proc file /proc//setgroups
A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.A value of "allow" means the segtoups system call is enabled.
- Descendant user namespaces inherit the value of setgroups from
their parents.- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"
10 Dec, 2014
1 commit
-
setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established. Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
05 Dec, 2014
1 commit
-
for now - just move corresponding ->proc_inum instances over there
Acked-by: "Eric W. Biederman"
Signed-off-by: Al Viro
09 Aug, 2014
1 commit
-
proc_uid_seq_operations, proc_gid_seq_operations and
proc_projid_seq_operations are only called in proc_id_map_open with
seq_open as const struct seq_operations so we can constify the 3
structures and update proc_id_map_open prototype.text data bss dec hex filename
6817 404 1984 9205 23f5 kernel/user_namespace.o-before
6913 308 1984 9205 23f5 kernel/user_namespace.o-afterSigned-off-by: Fabian Frederick
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Sep, 2013
1 commit
-
Add support for per-user_namespace registers of persistent per-UID kerberos
caches held within the kernel.This allows the kerberos cache to be retained beyond the life of all a user's
processes so that the user's cron jobs can work.The kerberos cache is envisioned as a keyring/key tree looking something like:
struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 big_key - A ccache blob
\___ tkt12345 big_key - Another ccache blobOr possibly:
struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 keyring - A ccache
\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
\___ http/REDHAT.COM@REDHAT.COM user
\___ afs/REDHAT.COM@REDHAT.COM user
\___ nfs/REDHAT.COM@REDHAT.COM user
\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
\___ http/KERNEL.ORG@KERNEL.ORG big_keyWhat goes into a particular Kerberos cache is entirely up to userspace. Kernel
support is limited to giving you the Kerberos cache keyring that you want.The user asks for their Kerberos cache by:
krb_cache = keyctl_get_krbcache(uid, dest_keyring);
The uid is -1 or the user's own UID for the user's own cache or the uid of some
other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to
mess with the cache.The cache returned is a keyring named "_krb." that the possessor can read,
search, clear, invalidate, unlink from and add links to. Active LSMs get a
chance to rule on whether the caller is permitted to make a link.Each uid's cache keyring is created when it first accessed and is given a
timeout that is extended each time this function is called so that the keyring
goes away after a while. The timeout is configurable by sysctl but defaults to
three days.Each user_namespace struct gets a lazily-created keyring that serves as the
register. The cache keyrings are added to it. This means that standard key
search and garbage collection facilities are available.The user_namespace struct's register goes away when it does and anything left
in it is then automatically gc'd.Signed-off-by: David Howells
Tested-by: Simo Sorce
cc: Serge E. Hallyn
cc: Eric W. Biederman
08 Sep, 2013
1 commit
-
Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users
27 Aug, 2013
1 commit
-
Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.Verify that the mounted filesystem is not covered in any significant
way. I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs. This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.Signed-off-by: "Eric W. Biederman"
09 Aug, 2013
1 commit
-
Ensure that user_namespace->parent chain can't grow too much.
Currently we use the hardroded 32 as limit.Reported-by: Andy Lutomirski
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
27 Mar, 2013
1 commit
-
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"
27 Jan, 2013
1 commit
-
When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.Remove struct kref and use a plain atomic_t. Making the code more
flexible and easier to comprehend. Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death. With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.Acked-by: Serge Hallyn
Pointed-out-by: Vasily Kulikov
Signed-off-by: "Eric W. Biederman"