Eric Lee / smarc-fsl-linux-kernel

17 Dec, 2014

1 commit

603ba7e41 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile #2 from Al Viro:
"Next pile (and there'll be one or two more).

The large piece in this one is getting rid of /proc/*/ns/* weirdness;
among other things, it allows to (finally) make nameidata completely
opaque outside of fs/namei.c, making for easier further cleanups in
there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coda_venus_readdir(): use file_inode()
fs/namei.c: fold link_path_walk() call into path_init()
path_init(): don't bother with LOOKUP_PARENT in argument
fs/namei.c: new helper (path_cleanup())
path_init(): store the "base" pointer to file in nameidata itself
make default ->i_fop have ->open() fail with ENXIO
make nameidata completely opaque outside of fs/namei.c
kill proc_ns completely
take the targets of /proc/*/ns/* symlinks to separate fs
bury struct proc_ns in fs/proc
copy address of proc_ns_ops into ns_common
new helpers: ns_alloc_inum/ns_free_inum
make proc_ns_operations work with struct ns_common * instead of void *
switch the rest of proc_ns_operations to working with &...->ns
netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
common object embedded into various struct ....ns

Linus Torvalds
2014-12-17 07:53:03 +0800

11 Dec, 2014

1 commit

a53b83154 exit: pidns: fix/update the comments in zap_pid_ns_processes() ... Browse Code »

The comments in zap_pid_ns_processes() are not clear, we need to explain
how this code actually works.

1. "Ignore SIGCHLD" looks like optimization but it is not, we also
need this for correctness.

2. The comment above sys_wait4() could tell more.

EXIT_ZOMBIE child is only possible if it has exited before we
ignored SIGCHLD. Or if it is traced from the parent namespace,
but in this case it will be reaped by debugger after detach,
sys_wait4() acts as a synchronization point.

3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is
outdated. Contrary to what it says we do not need to make sure
they all go away after 0a01f2cc390e "pidns: Make the pidns proc
mount/umount logic obvious".

At the same time, we do need to wait for nr_hashed==init_pids,
but the reasons are quite different and not obvious: setns().

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Aaron Tomlin
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:18 +0800

05 Dec, 2014

5 commits

33c429405 copy address of proc_ns_ops into ns_common ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-12-05 03:34:47 +0800
6344c433a new helpers: ns_alloc_inum/ns_free_inum ... Browse Code »

take struct ns_common *, for now simply wrappers around proc_{alloc,free}_inum()

Signed-off-by: Al Viro

Al Viro
2014-12-05 03:34:36 +0800
64964528b make proc_ns_operations work with struct ns_common * instead of void * ... Browse Code »

We can do that now. And kill ->inum(), while we are at it - all instances
are identical.

Signed-off-by: Al Viro

Al Viro
2014-12-05 03:34:17 +0800
3c0411846 switch the rest of proc_ns_operations to working with &...->ns ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-12-05 03:34:11 +0800
435d5f4bb common object embedded into various struct ....ns ... Browse Code »

for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman"
Signed-off-by: Al Viro

Al Viro
2014-12-05 03:31:00 +0800

03 Apr, 2014

1 commit

d23082257 pid_namespace: pidns_get() should check task_active_pid_ns() != NULL ... Browse Code »

pidns_get()->get_pid_ns() can hit ns == NULL. This task_struct can't
go away, but task_active_pid_ns(task) is NULL if release_task(task)
was already called. Alternatively we could change get_pid_ns(ns) to
check ns != NULL, but it seems that other callers are fine.

Signed-off-by: Oleg Nesterov
Cc: Eric W. Biederman ebiederm@xmission.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-03 07:20:21 +0800

25 Oct, 2013

1 commit

1adfcb03e pid_namespace: make freeing struct pid_namespace rcu-delayed ... Browse Code »

makes procfs ->premission() instances safety in RCU mode independent
from vfsmount_lock.

Signed-off-by: Al Viro

Al Viro
2013-10-25 11:43:29 +0800

08 Sep, 2013

1 commit

c7c4591db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.

The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.

Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users

Linus Torvalds
2013-09-08 05:35:32 +0800

31 Aug, 2013

1 commit

c7b96acf1 userns: Kill nsown_capable it makes the wrong thing easy ... Browse Code »

nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-08-31 14:44:11 +0800

28 Aug, 2013

1 commit

c2b1df2eb Rename nsproxy.pid_ns to nsproxy.pid_ns_for_children ... Browse Code »

nsproxy.pid_ns is *not* the task's pid namespace. The name should clarify
that.

This makes it more obvious that setns on a pid namespace is weird --
it won't change the pid namespace shown in procfs.

Signed-off-by: Andy Lutomirski
Reviewed-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Andy Lutomirski
2013-08-28 01:52:52 +0800

02 May, 2013

2 commits

20b4fb485 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...

Linus Torvalds
2013-05-02 08:51:54 +0800
0bb80f240 proc: Split the namespace stuff out into linux/proc_ns.h ... Browse Code »

Split the proc namespace stuff out into linux/proc_ns.h.

Signed-off-by: David Howells
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn
cc: Eric W. Biederman
Signed-off-by: Al Viro

David Howells
2013-05-02 05:29:39 +0800

01 May, 2013

1 commit

5cc544516 pid_namespace.c/.h: simplify defines ... Browse Code »

Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

[akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
Signed-off-by: Raphael S.Carvalho
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Raphael S.Carvalho
2013-05-01 08:04:07 +0800

26 Mar, 2013

1 commit

751c644b9 pid: Handle the exit of a multi-threaded init. ... Browse Code »

When a multi-threaded init exits and the initial thread is not the
last thread to exit the initial thread hangs around as a zombie
until the last thread exits. In that case zap_pid_ns_processes
needs to wait until there are only 2 hashed pids in the pid
namespace not one.

v2. Replace thread_pid_vnr(me) == 1 with the test thread_group_leader(me)
as suggested by Oleg.

Cc: stable@vger.kernel.org
Cc: Oleg Nesterov
Reported-by: Caj Larsson
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-03-26 18:41:23 +0800

26 Dec, 2012

1 commit

c876ad768 pidns: Stop pid allocation when init dies ... Browse Code »

Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.

Can lead to processes attempting access their child reaper and
instead following a stale pointer.

That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.

Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.

Pointed-out-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-12-26 08:10:05 +0800

15 Dec, 2012

1 commit

5e4a08476 userns: Require CAP_SYS_ADMIN for most uses of setns. ... Browse Code »

Andy Lutomirski found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-12-15 08:12:03 +0800

20 Nov, 2012

1 commit

98f842e67 proc: Usable inode numbers for the namespace file descriptors. ... Browse Code »

Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.

A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.

This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.

We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.

I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-11-20 20:19:49 +0800

19 Nov, 2012

6 commits

50804fe37 pidns: Support unsharing the pid namespace. ... Browse Code »

Unsharing of the pid namespace unlike unsharing of other namespaces
does not take affect immediately. Instead it affects the children
created with fork and clone. The first of these children becomes the init
process of the new pid namespace, the rest become oddball children
of pid 0. From the point of view of the new pid namespace the process
that created it is pid 0, as it's pid does not map.

A couple of different semantics were considered but this one was
settled on because it is easy to implement and it is usable from
pam modules. The core reasons for the existence of unshare.

I took a survey of the callers of pam modules and the following
appears to be a representative sample of their logic.
{
setup stuff include pam
child = fork();
if (!child) {
setuid()
exec /bin/bash
}
waitpid(child);

pam and other cleanup
}

As you can see there is a fork to create the unprivileged user
space process. Which means that the unprivileged user space
process will appear as pid 1 in the new pid namespace. Further
most login processes do not cope with extraneous children which
means shifting the duty of reaping extraneous child process to
the creator of those extraneous children makes the system more
comprehensible.

The practical reason for this set of pid namespace semantics is
that it is simple to implement and verify they work correctly.
Whereas an implementation that requres changing the struct
pid on a process comes with a lot more races and pain. Not
the least of which is that glibc caches getpid().

These semantics are implemented by having two notions
of the pid namespace of a proces. There is task_active_pid_ns
which is the pid namspace the process was created with
and the pid namespace that all pids are presented to
that process in. The task_active_pid_ns is stored
in the struct pid of the task.

Then there is the pid namespace that will be used for children
that pid namespace is stored in task->nsproxy->pid_ns.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-11-19 21:59:16 +0800
57e8391d3 pidns: Add setns support ... Browse Code »

- Pid namespaces are designed to be inescapable so verify that the
passed in pid namespace is a child of the currently active
pid namespace or the currently active pid namespace itself.

Allowing the currently active pid namespace is important so
the effects of an earlier setns can be cancelled.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-11-19 21:59:14 +0800
225778d68 pidns: Deny strange cases when creating pid namespaces. ... Browse Code »

task_active_pid_ns(current) != current->ns_proxy->pid_ns will
soon be allowed to support unshare and setns.

The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of current->ns_proxy->pid_ns. However
that leads to strange cases like trying to have a single process be
init in multiple pid namespaces, which is racy and hard to think
about.

The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of task_active_pid_ns(current). While
that seems less racy it does not provide any utility.

Therefore define the semantics of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
pid namespace creation fails. That is easy to implement and easy
to think about.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-19 21:59:13 +0800
af4b8a83a pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 ... Browse Code »

Looking at pid_ns->nr_hashed is a bit simpler and it works for
disjoint process trees that an unshare or a join of a pid_namespace
may create.

Acked-by: "Serge E. Hallyn"
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-19 21:59:12 +0800
0a01f2cc3 pidns: Make the pidns proc mount/umount logic obvious. ... Browse Code »

Track the number of pids in the proc hash table. When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.

Move the mount of proc into alloc_pid when we allocate the pid for
init.

Remove the surprising calls of pid_ns_release proc in fork and
proc_flush_task. Those code paths really shouldn't know about proc
namespace implementation details and people have demonstrated several
times that finding and understanding those code paths is difficult and
non-obvious.

Because of the call path detach pid is alwasy called with the
rtnl_lock held free_pid is not allowed to sleep, so the work to
unmounting proc is moved to a work queue. This has the side benefit
of not blocking the entire world waiting for the unnecessary
rcu_barrier in deactivate_locked_super.

In the process of making the code clear and obvious this fixes a bug
reported by Gao feng where we would leak a
mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
succeeded and copy_net_ns failed.

Acked-by: "Serge E. Hallyn"
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-19 21:59:10 +0800
49f4d8b93 pidns: Capture the user namespace and filter ns_last_pid ... Browse Code »

- Capture the the user namespace that creates the pid namespace
- Use that user namespace to test if it is ok to write to
/proc/sys/kernel/ns_last_pid.

Zhao Hongjiang noticed I was missing a put_user_ns
in when destroying a pid_ns. I have foloded his patch into this one
so that bisects will work properly.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-19 21:57:31 +0800

26 Oct, 2012

1 commit

f23025057 pidns: limit the nesting depth of pid namespaces ... Browse Code »

'struct pid' is a "variable sized struct" - a header with an array of
upids at the end.

The size of the array depends on a level (depth) of pid namespaces. Now a
level of pidns is not limited, so 'struct pid' can be more than one page.

Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is
not calculated from PAGE_SIZE, because in this case it depends on
architectures, config options and it will be reduced, if someone adds a
new fields in struct pid or struct upid.

I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
"struct pid" and it's more than enough for all known for me use-cases.
When someone finds a reasonable use case, we can add a config option or a
sysctl parameter.

In addition it will reduce the effect of another problem, when we have
many nested namespaces and the oldest one starts dying.
zap_pid_ns_processe will be called for each namespace and find_vpid will
be called for each process in a namespace. find_vpid will be called
minimum max_level^2 / 2 times. The reason of that is that when we found a
bit in pidmap, we can't determine this pidns is top for this process or it
isn't.

vpid is a heavy operation, so a fork bomb, which create many nested
namespace, can make a system inaccessible for a long time. For example my
system becomes inaccessible for a few minutes with 4000 processes.

[akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
Signed-off-by: Andrew Vagin
Acked-by: Oleg Nesterov
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Vagin
2012-10-26 05:37:53 +0800

20 Oct, 2012

1 commit

bbc2e3ef8 pidns: remove recursion from free_pid_ns() ... Browse Code »

free_pid_ns() operates in a recursive fashion:

free_pid_ns(parent)
put_pid_ns(parent)
kref_put(&ns->kref, free_pid_ns);
free_pid_ns

thus if there was a huge nesting of namespaces the userspace may trigger
avalanche calling of free_pid_ns leading to kernel stack exhausting and a
panic eventually.

This patch turns the recursion into an iterative loop.

Based on a patch by Andrew Vagin.

[akpm@linux-foundation.org: export put_pid_ns() to modules]
Signed-off-by: Cyrill Gorcunov
Cc: Andrew Vagin
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-10-20 05:07:47 +0800

03 Oct, 2012

1 commit

437589a74 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.

The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.

The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.

Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.

Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.

While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.

Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.

After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...

Linus Torvalds
2012-10-03 02:11:09 +0800

18 Sep, 2012

1 commit

579035dc5 pid-namespace: limit value of ns_last_pid to (0, max_pid) ... Browse Code »

The kernel doesn't check the pid for negative values, so if you try to
write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.

The crash happens because the next pid is -1, and alloc_pidmap() will
try to access to a nonexistent pidmap.

map = &pid_ns->pidmap[pid/BITS_PER_PAGE];

Signed-off-by: Andrew Vagin
Acked-by: Cyrill Gorcunov
Acked-by: Oleg Nesterov
Cc: Eric W. Biederman
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Vagin
2012-09-18 06:00:38 +0800

15 Aug, 2012

1 commit

523a6a945 pidns: Export free_pid_ns ... Browse Code »

There is a least one modular user so export free_pid_ns so modules can
capture and use the pid namespace on the very rare occasion when it
makes sense.

Acked-by: David S. Miller
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-08-15 12:49:35 +0800

21 Jun, 2012

1 commit

6347e9009 pidns: guarantee that the pidns init will be the last pidns process reaped ... Browse Code »

Today we have a twofold bug. Sometimes release_task on pid == 1 in a pid
namespace can run before other processes in a pid namespace have had
release task called. With the result that pid_ns_release_proc can be
called before the last proc_flus_task() is done using upid->ns->proc_mnt,
resulting in the use of a stale pointer. This same set of circumstances
can lead to waitpid(...) returning for a processes started with
clone(CLONE_NEWPID) before the every process in the pid namespace has
actually exited.

To fix this modify zap_pid_ns_processess wait until all other processes in
the pid namespace have exited, even EXIT_DEAD zombies.

The delay_group_leader and related tests ensure that the thread gruop
leader will be the last thread of a process group to be reaped, or to
become EXIT_DEAD and self reap. With the change to zap_pid_ns_processes
we get the guarantee that pid == 1 in a pid namespace will be the last
task that release_task is called on.

With pid == 1 being the last task to pass through release_task
pid_ns_release_proc can no longer be called too early nor can wait return
before all of the EXIT_DEAD tasks in a pid namespace have exited.

Signed-off-by: Eric W. Biederman
Signed-off-by: Oleg Nesterov
Cc: Louis Rilling
Cc: Mike Galbraith
Acked-by: Pavel Emelyanov
Tested-by: Andrew Wagin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2012-06-21 05:39:36 +0800

01 Jun, 2012

2 commits

98ed57eef sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE ... Browse Code »

For those who doesn't need C/R functionality there is no need to control
last pid, ie the pid for the next fork() call.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Tejun Heo
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
00c10bc13 pidns: make killed children autoreap ... Browse Code »

Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
that the children autoreap. This increases the parallelize and in general
the speed of network namespace shutdown.

Note self reaping childrean can exist past zap_pid_ns_processess but they
will all be reaped before we allow the pid namespace init task with pid ==
1 to be reaped.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Louis Rilling
Cc: Mike Galbraith
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2012-06-01 08:49:32 +0800

29 Mar, 2012

1 commit

cf3f89214 pidns: add reboot_pid_ns() to handle the reboot syscall ... Browse Code »

In the case of a child pid namespace, rebooting the system does not really
makes sense. When the pid namespace is used in conjunction with the other
namespaces in order to create a linux container, the reboot syscall leads
to some problems.

A container can reboot the host. That can be fixed by dropping the
sys_reboot capability but we are unable to correctly to poweroff/
halt/reboot a container and the container stays stuck at the shutdown time
with the container's init process waiting indefinitively.

After several attempts, no solution from userspace was found to reliabily
handle the shutdown from a container.

This patch propose to make the init process of the child pid namespace to
exit with a signal status set to : SIGINT if the child pid namespace
called "halt/poweroff" and SIGHUP if the child pid namespace called
"reboot". When the reboot syscall is called and we are not in the initial
pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
"RESTART", and "RESTART2". Otherwise we return EINVAL.

Returning EINVAL is also an easy way to check if this feature is supported
by the kernel when invoking another 'reboot' option like CAD.

By this way the parent process of the child pid namespace knows if it
rebooted or not and can take the right decision.

Test case:
==========

#include
#include
#include
#include
#include
#include
#include
#include

#include

static int do_reboot(void *arg)
{
int *cmd = arg;

if (reboot(*cmd))
printf("failed to reboot(%d): %m\n", *cmd);
}

int test_reboot(int cmd, int sig)
{
long stack_size = 4096;
void *stack = alloca(stack_size) + stack_size;
int status;
pid_t ret;

ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
if (ret < 0) {
printf("failed to clone: %m\n");
return -1;
}

if (wait(&status) < 0) {
printf("unexpected wait error: %m\n");
return -1;
}

if (!WIFSIGNALED(status)) {
printf("child process exited but was not signaled\n");
return -1;
}

if (WTERMSIG(status) != sig) {
printf("signal termination is not the one expected\n");
return -1;
}

return 0;
}

int main(int argc, char *argv[])
{
int status;

status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
if (status >= 0) {
printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
return 1;
}
printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");

return 0;
}

[akpm@linux-foundation.org: tweak and add comments]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Daniel Lezcano
Acked-by: Serge Hallyn
Tested-by: Serge Hallyn
Reviewed-by: Oleg Nesterov
Cc: Michael Kerrisk
Cc: "Eric W. Biederman"
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Lezcano
2012-03-29 08:14:36 +0800

24 Mar, 2012

1 commit

a02d6fd64 signal: zap_pid_ns_processes: s/SEND_SIG_NOINFO/SEND_SIG_FORCED/ ... Browse Code »

Change zap_pid_ns_processes() to use SEND_SIG_FORCED, it looks more
clear compared to SEND_SIG_NOINFO which relies on from_ancestor_ns logic
send_signal().

It is also more efficient if we need to kill a lot of tasks because it
doesn't alloc sigqueue.

While at it, add the __fatal_signal_pending(task) check as a minor
optimization.

Signed-off-by: Oleg Nesterov
Cc: Tejun Heo
Cc: Anton Vorontsov
Cc: "Eric W. Biederman"
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-03-24 07:58:41 +0800

13 Jan, 2012

1 commit

b8f566b04 sysctl: add the kernel.ns_last_pid control ... Browse Code »

The sysctl works on the current task's pid namespace, getting and setting
its last_pid field.

Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible
to create a task with desired pid value. This ability is required badly
for the checkpoint/restore in userspace.

This approach suits all the parties for now.

Signed-off-by: Pavel Emelyanov
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2012-01-13 12:13:11 +0800

24 Mar, 2011

1 commit

4308eebbe pidns: call pid_ns_prepare_proc() from create_pid_namespace() ... Browse Code »

Reorganize proc_get_sb() so it can be called before the struct pid of the
first process is allocated.

Signed-off-by: Eric W. Biederman
Signed-off-by: Daniel Lezcano
Cc: Oleg Nesterov
Cc: Alexey Dobriyan
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2011-03-24 10:46:58 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

13 Mar, 2010

1 commit

13aa9a6b0 pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO instead of force_sig() ... Browse Code »

zap_pid_ns_processes() uses force_sig(SIGKILL) to ensure SIGKILL will be
delivered to sub-namespace inits as well. This is correct, but we are
going to change force_sig_info() semantics. See
http://bugzilla.kernel.org/show_bug.cgi?id=15395#c31

We can use send_sig_info(SEND_SIG_NOINFO) instead, since
614c517d7c00af1b26ded20646b329397d6f51a1 ("signals: SEND_SIG_NOINFO should
be considered as SI_FROMUSER()") SEND_SIG_NOINFO means "from user" and
therefore send_signal() will get the correct from_ancestor_ns = T flag.

Signed-off-by: Oleg Nesterov
Acked-by: Serge Hallyn
Acked-by: Linus Torvalds
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-03-13 07:52:40 +0800

24 Sep, 2009

1 commit

e5a473869 pidns: deny CLONE_PARENT|CLONE_NEWPID combination ... Browse Code »

CLONE_PARENT was used to implement an older threading model. For
consistency with the CLONE_THREAD check in copy_pid_ns(), disable
CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of
pid namespaces are clear.

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Roland McGrath
Acked-by: Serge Hallyn
Cc: Oren Laadan
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2009-09-24 22:21:04 +0800