07 Jun, 2014
1 commit
-
-uid->gid
-split some function declarations
-if/then/else warningSigned-off-by: Fabian Frederick
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Apr, 2014
1 commit
-
smp_read_barrier_depends() can be used if there is data dependency between
the readers - i.e. if the read operation after the barrier uses address
that was obtained from the read operation before the barrier.In this file, there is only control dependency, no data dependecy, so the
use of smp_read_barrier_depends() is incorrect. The code could fail in the
following way:
* the cpu predicts that idx < entries is true and starts executing the
body of the for loop
* the cpu fetches map->extent[0].first and map->extent[0].count
* the cpu fetches map->nr_extents
* the cpu verifies that idx < extents is true, so it commits the
instructions in the body of the for loopThe problem is that in this scenario, the cpu read map->extent[0].first
and map->nr_extents in the wrong order. We need a full read memory barrier
to prevent it.Signed-off-by: Mikulas Patocka
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds
04 Apr, 2014
1 commit
-
Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular. So using module_init as an
alias for __initcall can be somewhat misleading.Fix these up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.The audit targets the following module_init users for change:
kernel/user.c obj-y
kernel/kexec.c bool KEXEC (one instance per arch)
kernel/profile.c bool PROFILING
kernel/hung_task.c bool DETECT_HUNG_TASK
kernel/sched/stats.c bool SCHEDSTATS
kernel/user_namespace.c bool USER_NSNote that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e. slightly earlier). However no observable impact of that
difference has been observed during testing.Also, two instances of missing ";" at EOL are fixed in kexec.
Signed-off-by: Paul Gortmaker
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Eric Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Feb, 2014
1 commit
-
Signed-off-by: Brian Campbell
Signed-off-by: Linus Torvalds
24 Sep, 2013
1 commit
-
Add support for per-user_namespace registers of persistent per-UID kerberos
caches held within the kernel.This allows the kerberos cache to be retained beyond the life of all a user's
processes so that the user's cron jobs can work.The kerberos cache is envisioned as a keyring/key tree looking something like:
struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 big_key - A ccache blob
\___ tkt12345 big_key - Another ccache blobOr possibly:
struct user_namespace
\___ .krb_cache keyring - The register
\___ _krb.0 keyring - Root's Kerberos cache
\___ _krb.5000 keyring - User 5000's Kerberos cache
\___ _krb.5001 keyring - User 5001's Kerberos cache
\___ tkt785 keyring - A ccache
\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
\___ http/REDHAT.COM@REDHAT.COM user
\___ afs/REDHAT.COM@REDHAT.COM user
\___ nfs/REDHAT.COM@REDHAT.COM user
\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
\___ http/KERNEL.ORG@KERNEL.ORG big_keyWhat goes into a particular Kerberos cache is entirely up to userspace. Kernel
support is limited to giving you the Kerberos cache keyring that you want.The user asks for their Kerberos cache by:
krb_cache = keyctl_get_krbcache(uid, dest_keyring);
The uid is -1 or the user's own UID for the user's own cache or the uid of some
other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to
mess with the cache.The cache returned is a keyring named "_krb." that the possessor can read,
search, clear, invalidate, unlink from and add links to. Active LSMs get a
chance to rule on whether the caller is permitted to make a link.Each uid's cache keyring is created when it first accessed and is given a
timeout that is extended each time this function is called so that the keyring
goes away after a while. The timeout is configurable by sysctl but defaults to
three days.Each user_namespace struct gets a lazily-created keyring that serves as the
register. The cache keyrings are added to it. This means that standard key
search and garbage collection facilities are available.The user_namespace struct's register goes away when it does and anything left
in it is then automatically gc'd.Signed-off-by: David Howells
Tested-by: Simo Sorce
cc: Serge E. Hallyn
cc: Eric W. Biederman
08 Sep, 2013
1 commit
-
Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users
27 Aug, 2013
1 commit
-
Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.Verify that the mounted filesystem is not covered in any significant
way. I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs. This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.Signed-off-by: "Eric W. Biederman"
09 Aug, 2013
1 commit
-
Ensure that user_namespace->parent chain can't grow too much.
Currently we use the hardroded 32 as limit.Reported-by: Andy Lutomirski
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
07 Aug, 2013
1 commit
-
unshare_userns(new_cred) does *new_cred = prepare_creds() before
create_user_ns() which can fail. However, the caller expects that
it doesn't need to take care of new_cred if unshare_userns() fails.We could change the single caller, sys_unshare(), but I think it
would be more clean to avoid the side effects on failure, so with
this patch unshare_userns() does put_cred() itself and initializes
*new_cred only if create_user_ns() succeeeds.Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
02 May, 2013
2 commits
-
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
... -
Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: David Howells
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn
cc: Eric W. Biederman
Signed-off-by: Al Viro
15 Apr, 2013
3 commits
-
Changing uid/gid/projid mappings doesn't change your id within the
namespace; it reconfigures the namespace. Unprivileged programs should
*not* be able to write these files. (We're also checking the privileges
on the wrong task.)Given the write-once nature of these files and the other security
checks, this is likely impossible to usefully exploit.Signed-off-by: Andy Lutomirski
-
Signed-off-by: Andy Lutomirski
-
When we require privilege for setting /proc//uid_map or
/proc//gid_map no longer allow an unprivileged user to
open the file and pass it to a privileged program to write
to the file.Instead when privilege is required require both the opener and the
writer to have the necessary capabilities.I have tested this code and verified that setting /proc//uid_map
fails when an unprivileged user opens the file and a privielged user
attempts to set the mapping, that unprivileged users can still map
their own id, and that a privileged users can still setup an arbitrary
mapping.Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Andy Lutomirski
27 Mar, 2013
2 commits
-
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman" -
Guarantee that the policy of which files may be access that is
established by setting the root directory will not be violated
by user namespaces by verifying that the root directory points
to the root of the mount namespace at the time of user namespace
creation.Changing the root is a privileged operation, and as a matter of policy
it serves to limit unprivileged processes to files below the current
root directory.For reasons of simplicity and comprehensibility the privilege to
change the root directory is gated solely on the CAP_SYS_CHROOT
capability in the user namespace. Therefore when creating a user
namespace we must ensure that the policy of which files may be access
can not be violated by changing the root directory.Anyone who runs a processes in a chroot and would like to use user
namespace can setup the same view of filesystems with a mount
namespace instead. With this result that this is not a practical
limitation for using user namespaces.Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn
Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
14 Mar, 2013
1 commit
-
Don't allowing sharing the root directory with processes in a
different user namespace. There doesn't seem to be any point, and to
allow it would require the overhead of putting a user namespace
reference in fs_struct (for permission checks) and incrementing that
reference count on practically every call to fork.So just perform the inexpensive test of forbidding sharing fs_struct
acrosss processes in different user namespaces. We already disallow
other forms of threading when unsharing a user namespace so this
should be no real burden in practice.This updates setns, clone, and unshare to disallow multiple user
namespaces sharing an fs_struct.Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Linus Torvalds
27 Jan, 2013
2 commits
-
When I initially wrote the code for /proc//uid_map. I was lazy
and avoided duplicate mappings by the simple expedient of ensuring the
first number in a new extent was greater than any number in the
previous extent.Unfortunately that precludes a number of valid mappings, and someone
noticed and complained. So use a simple check to ensure that ranges
in the mapping extents don't overlap.Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman" -
When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.Remove struct kref and use a plain atomic_t. Making the code more
flexible and easier to comprehend. Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death. With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.Acked-by: Serge Hallyn
Pointed-out-by: Vasily Kulikov
Signed-off-by: "Eric W. Biederman"
15 Dec, 2012
1 commit
-
Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"
20 Nov, 2012
5 commits
-
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.Signed-off-by: Eric W. Biederman
-
To keep things sane in the context of file descriptor passing derive the
user namespace that uids are mapped into from the opener of the file
instead of from current.When writing to the maps file the lower user namespace must always
be the parent user namespace, or setting the mapping simply does
not make sense. Enforce that the opener of the file was in
the parent user namespace or the user namespace whose mapping
is being set.Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman" -
- Add CLONE_THREAD to the unshare flags if CLONE_NEWUSER is selected
As changing user namespaces is only valid if all there is only
a single thread.
- Restore the code to add CLONE_VM if CLONE_THREAD is selected and
the code to addCLONE_SIGHAND if CLONE_VM is selected.
Making the constraints in the code clear.Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman" -
This allows entering a user namespace, and the ability
to store a reference to a user namespace with a bind
mount.Addition of missing userns_ns_put in userns_install
from Gao fengAcked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman" -
Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"
18 Sep, 2012
1 commit
-
Implement kprojid_t a cousin of the kuid_t and kgid_t.
The per user namespace mapping of project id values can be set with
/proc//projid_map.A full compliment of helpers is provided: make_kprojid, from_kprojid,
from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
projid_eq, projid_lt.Project identifiers are part of the generic disk quota interface,
although it appears only xfs implements project identifiers currently.The xfs code allows anyone who has permission to set the project
identifier on a file to use any project identifier so when
setting up the user namespace project identifier mappings I do
not require a capability.Cc: Dave Chinner
Cc: Jan Kara
Signed-off-by: "Eric W. Biederman"
03 May, 2012
1 commit
-
cred.h and a few trivial users of struct cred are changed. The rest of the users
of struct cred are left for other patches as there are too many changes to make
in one go and leave the change reviewable. If the user namespace is disabled and
CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile
and behave correctly.Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman
26 Apr, 2012
2 commits
-
- Convert the old uid mapping functions into compatibility wrappers
- Add a uid/gid mapping layer from user space uid and gids to kernel
internal uids and gids that is extent based for simplicty and speed.
* Working with number space after mapping uids/gids into their kernel
internal version adds only mapping complexity over what we have today,
leaving the kernel code easy to understand and test.
- Add proc files /proc/self/uid_map /proc/self/gid_map
These files display the mapping and allow a mapping to be added
if a mapping does not exist.
- Allow entering the user namespace without a uid or gid mapping.
Since we are starting with an existing user our uids and gids
still have global mappings so are still valid and useful they just don't
have local mappings. The requirement for things to work are global uid
and gid so it is odd but perfectly fine not to have a local uid
and gid mapping.
Not requiring global uid and gid mappings greatly simplifies
the logic of setting up the uid and gid mappings by allowing
the mappings to be set after the namespace is created which makes the
slight weirdness worth it.
- Make the mappings in the initial user namespace to the global
uid/gid space explicit. Today it is an identity mapping
but in the future we may want to twist this for debugging, similar
to what we do with jiffies.
- Document the memory ordering requirements of setting the uid and
gid mappings. We only allow the mappings to be set once
and there are no pointers involved so the requirments are
trivial but a little atypical.Performance:
In this scheme for the permission checks the performance is expected to
stay the same as the actuall machine instructions should remain the same.The worst case I could think of is ls -l on a large directory where
all of the stat results need to be translated with from kuids and
kgids to uids and gids. So I benchmarked that case on my laptop
with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.My benchmark consisted of going to single user mode where nothing else
was running. On an ext4 filesystem opening 1,000,000 files and looping
through all of the files 1000 times and calling fstat on the
individuals files. This was to ensure I was benchmarking stat times
where the inodes were in the kernels cache, but the inode values were
not in the processors cache. My results:v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)All of the configurations ran in roughly 120ns when I performed tests
that ran in the cpu cache.So in summary the performance impact is:
1ns improvement in the worst case with user namespace support compiled out.
8ns aka 5% slowdown in the worst case with user namespace support compiled in.Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman -
- Transform userns->creator from a user_struct reference to a simple
kuid_t, kgid_t pair.In cap_capable this allows the check to see if we are the creator of
a namespace to become the classic suser style euid permission check.This allows us to remove the need for a struct cred in the mapping
functions and still be able to dispaly the user namespace creators
uid and gid as 0.- Remove the now unnecessary delayed_work in free_user_ns.
All that is left for free_user_ns to do is to call kmem_cache_free
and put_user_ns. Those functions can be called in any context
so call them directly from free_user_ns removing the need for delayed work.Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman
08 Apr, 2012
5 commits
-
Modify alloc_uid to take a kuid and make the user hash table global.
Stop holding a reference to the user namespace in struct user_struct.This simplifies the code and makes the per user accounting not
care about which user namespace a uid happens to appear in.Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman -
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman -
I am about to remove the struct user_namespace reference from struct user_struct.
So keep an explicit track of the parent user namespace.Take advantage of this new reference and replace instances of user_ns->creator->user_ns
with user_ns->parent.Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman -
struct user_struct will shortly loose it's user_ns reference
so make the cred user_ns reference a proper reference complete
with reference counting.Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman -
Optimize performance and prepare for the removal of the user_ns reference
from user_struct. Remove the slow long walk through cred->user->user_ns and
instead go straight to cred->user_ns.Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman
31 Oct, 2011
1 commit
-
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.Nothing to see here but a whole lot of instances of:
-#include
+#includeThis commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.Signed-off-by: Paul Gortmaker
14 Jan, 2011
1 commit
-
Currently on 64-bit arch the user_namespace is 2096 and when being
kmalloc-ed it resides on a 4k slab wasting 2003 bytes.If we allocate a separate cache for it and reduce the hash size from 128
to 64 chains the packaging becomes *much* better - the struct is 1072
bytes and the hole between is 98 bytes.[akpm@linux-foundation.org: s/__initcall/module_init/]
Signed-off-by: Pavel Emelyanov
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds