Doug / smarc-fsl-linux-kernel | Embedian Git Server

30 Nov, 2007

2 commits

e6ceb32aa wait_task_stopped(): pass correct exit_code to wait_noreap_copyout() ... Browse Code »

In wait_task_stopped() exit_code already contains the right value for the
si_status member of siginfo, and this is simply set in the non WNOWAIT
case.

If you call waitid() with a stopped or traced process, you'll get the signal
in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the
same time, you'll get the signal << 8 | 0x7f

Pass it unchanged to wait_noreap_copyout(); we would only need to shift it
and add 0x7f if we were returning it in the user status field and that
isn't used for any function that permits WNOWAIT.

Signed-off-by: Scott James Remnant
Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Scott James Remnant
2007-11-30 01:24:55 +0800
c89507835 wait_task_stopped(): don't use task_pid_nr_ns() lockless ... Browse Code »

wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock,
we can read an already freed memory. Use the cached pid_t value.

Signed-off-by: Oleg Nesterov
Looks-good-to: Roland McGrath
Acked-by: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-11-30 01:24:52 +0800

16 Nov, 2007

1 commit

a3474224e wait_task_stopped: Check p->exit_state instead of TASK_TRACED ... Browse Code »

The original meaning of the old test (p->state > TASK_STOPPED) was
"not dead", since it was before TASK_TRACED existed and before the
state/exit_state split. It was a wrong correction in commit
14bf01bb0599c89fc7f426d20353b76e12555308 to make this test for
TASK_TRACED instead. It should have been changed when TASK_TRACED
was introducted and again when exit_state was introduced.

Signed-off-by: Roland McGrath
Cc: Oleg Nesterov
Cc: Alexey Dobriyan
Cc: Kees Cook
Acked-by: Scott James Remnant
Signed-off-by: Linus Torvalds

Roland McGrath
2007-11-16 00:36:27 +0800

20 Oct, 2007

16 commits

a39bc5169 Uninline fork.c/exit.c ... Browse Code »

Save ~650 bytes here.

add/remove: 4/0 grow/shrink: 0/7 up/down: 430/-1088 (-658)
function old new delta
__copy_fs_struct - 202 +202
__put_fs_struct - 112 +112
__exit_fs - 58 +58
__exit_files - 58 +58
exit_files 58 2 -56
put_fs_struct 112 5 -107
exit_fs 161 2 -159
sys_unshare 774 590 -184
copy_process 4031 3840 -191
do_exit 1791 1597 -194
copy_fs_struct 202 5 -197

No difference in lmbench lat_proc tests on 2-way Opteron 246.
Smaaaal degradation on UP P4 (within errors).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan
Cc: Arjan van de Ven
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2007-10-20 02:53:56 +0800
ba25f9dcc Use helpers to obtain task pid in printks ... Browse Code »

The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov
Cc: Dave Airlie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:43 +0800
9a2e70572 Isolate the explicit usage of signal->pgrp ... Browse Code »

The pgrp field is not used widely around the kernel so it is now marked as
deprecated with appropriate comment.

The initialization of INIT_SIGNALS is trimmed because
a) they are set to 0 automatically;
b) gcc cannot properly initialize two anonymous (the second one
is the one with the session) unions. In this particular case
to make it compile we'd have to add some field initialized
right before the .pgrp.

This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one
(from Cedric), but for the pgrp field.

Some progress report:

We have to deprecate the pid, tgid, session and pgrp fields on struct
task_struct and struct signal_struct. The session and pgrp are already
deprecated. The tgid value is close to being such - the worst known usage
in in fs/locks.c and audit code. The pid field deprecation is mainly
blocked by numerous printk-s around the kernel that print the tsk->pid to
log.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Cedric Le Goater
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: Herbert Poetzl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:43 +0800
b488893a3 pid namespaces: changes to show virtual ids to user ... Browse Code »

This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Alexey Dobriyan
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
3eb07c8c8 pid namespaces: destroy pid namespace on init's death ... Browse Code »

Terminate all processes in a namespace when the reaper of the namespace is
exiting. We do this by walking the pidmap of the namespace and sending
SIGKILL to all processes.

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2007-10-20 02:53:40 +0800
60347f671 pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees ... Browse Code »

The first part is trivial - we just make the proc_flush_task() to operate on
arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
it.

The other change is more tricky: I moved the proc_flush_task() call in
release_task() higher to address the following problem.

When flushing task from many proc trees we need to know the set of ids (not
just one pid) to find the dentries' names to flush. Thus we need to pass the
task's pid to proc_flush_task() as struct pid is the only object that can
provide all the pid numbers. But after __exit_signal() task has detached all
his pids and this information is lost.

This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
tree and keep them in hash (since pids are still alive before __exit_signal())
till the next shrink, but since proc_flush_task() does not provide a 100%
guarantee that the dentries will be flushed, this is OK to do so.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:38 +0800
2e4a70726 pid namespaces: move exit_task_namespaces() ... Browse Code »

Make task release its namespaces after it has reparented all his children to
child_reaper, but before it notifies its parent about its death.

The reason to release namespaces after reparenting is that when task exits it
may send a signal to its parent (SIGCHLD), but if the parent has already
exited its namespaces there will be no way to decide what pid to dever to him
- parent can be from different namespace.

The reason to release namespace before notifying the parent it that when task
sends a SIGCHLD to parent it can call wait() on this taks and release it. But
releasing the mnt namespace implies dropping of all the mounts in the mnt
namespace and NFS expects the task to have valid sighand pointer.

Thanks to Oleg for pointing out some races that can apear and helping with
patches and fixes.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:38 +0800
762a24bee pid namespaces: rework forget_original_parent() ... Browse Code »

A pid namespace is a "view" of a particular set of tasks on the system. They
work in a similar way to filesystem namespaces. A file (or a process) can be
accessed in multiple namespaces, but it may have a different name in each. In
a filesystem, this name might be /etc/passwd in one namespace, but
/chroot/etc/passwd in another.

For processes, a process may have pid 1234 in one namespace, but be pid 1 in
another. This allows new pid namespaces to have basically arbitrary pids, and
not have to worry about what pids exist in other namespaces. This is
essential for checkpoint/restart where a restarted process's pid might collide
with an existing process on the system's pid.

In this particular implementation, pid namespaces have a parent-child
relationship, just like processes. A process in a pid namespace may see all
of the processes in the same namespace, as well as all of the processes in all
of the namespaces which are children of its namespace. Processes may not,
however, see others which are in their parent's namespace, but not in their
own. The same goes for sibling namespaces.

The know issue to be solved in the nearest future is signal handling in the
namespace boundary. That is, currently the namespace's init is treated like
an ordinary task that can be killed from within an namespace. Ideally, the
signal handling by the namespace's init should have two sides: when signaling
the init from its namespace, the init should look like a real init task, i.e.
receive only those signals, that is explicitly wants to; when signaling the
init from one of the parent namespaces, init should look like an ordinary
task, i.e. receive any signal, only taking the general permissions into
account.

The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
we eventually came to almost the same implementation, which differed in some
details. This set is based on Pavel's patches, but it includes comments and
patches that from Sukadev.

Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
valuable advises on how to make this set cleaner.

This patch:

We have to call exit_task_namespaces() only after the exiting task has
reparented all his children and is sure that no other threads will reparent
theirs for it. Why this is needed is explained in appropriate patch. This
one only reworks the forget_original_parent() so that after calling this a
task cannot be/become parent of any other task.

We check PF_EXITING instead of ->exit_state while choosing the new parent.
Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
us (when forget_original_parent() drops it) must see PF_EXITING.

The other changes are just cleanups. They just move some code from
exit_notify to forget_original_parent(). It is a bit silly to declare
ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
forget_original_parent(), unlock-lock-unlock tasklist, and then use
ptrace_dead.

Signed-off-by: Oleg Nesterov
Signed-off-by: Pavel Emelyanov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-20 02:53:38 +0800
d4c5e41f3 whitespace fixes: task exit handling ... Browse Code »

Signed-off-by: Daniel Walker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Walker
2007-10-20 02:53:38 +0800
03ff17979 kernel/exit.c: Use list_for_each_entry(_safe) instead of list_for_each(_safe) ... Browse Code »

kernel/exit.c: Convert list_for_each(_safe) to
list_for_each_entry(_safe) in forget_original_parent(), exit_notify()
and do_wait()

Signed-off-by: Matthias Kaehlcke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthias Kaehlcke
2007-10-20 02:53:38 +0800
cf7b708c8 Make access to task's nsproxy lighter ... Browse Code »

When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists. This is
slow on read-only paths and may be impossible in some cases.

E.g. Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.

On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.

The access to other task namespaces is proposed to be performed
like this:

rcu_read_lock();
nsproxy = task_nsproxy(tsk);
if (nsproxy != NULL) {
/ *
* work with the namespaces here
* e.g. get the reference on one of them
* /
} / *
* NULL task_nsproxy() means that this task is
* almost dead (zombie)
* /
rcu_read_unlock();

This patch has passed the review by Eric and Oleg :) and,
of course, tested.

[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Serge Hallyn
Signed-off-by: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:37 +0800
b460cbc58 pid namespaces: define is_global_init() and is_container_init() ... Browse Code »

is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().

A cgroup init has it's tsk->pid == 1.

A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.

Changelog:

2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().

2.6.21-mm2-pidns2:

- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.

[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn
Signed-off-by: Sukadev Bhattiprolu
Acked-by: Pavel Emelianov
Cc: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Herbert Poetzel
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2007-10-20 02:53:37 +0800
88f21d818 pid namespaces: rename child_reaper() function ... Browse Code »

Rename the child_reaper() function to task_child_reaper() to be similar to
other task_* functions and to distinguish the function from 'struct
pid_namspace.child_reaper'.

Signed-off-by: Sukadev Bhattiprolu
Cc: Pavel Emelianov
Cc: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Serge Hallyn
Cc: Herbert Poetzel
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2007-10-20 02:53:37 +0800
a47afb0f9 pid namespaces: round up the API ... Browse Code »

The set of functions process_session, task_session, process_group and
task_pgrp is confusing, as the names can be mixed with each other when looking
at the code for a long time.

The proposals are to
* equip the functions that return the integer with _nr suffix to
represent that fact,
* and to make all functions work with task (not process) by making
the common prefix of the same name.

For monotony the routines signal_session() and set_signal_session() are
replaced with task_session_nr() and set_task_session(), especially since they
are only used with the explicit task->signal dereference.

Signed-off-by: Pavel Emelianov
Acked-by: Serge E. Hallyn
Cc: Kirill Korotaev
Cc: "Eric W. Biederman"
Cc: Cedric Le Goater
Cc: Herbert Poetzl
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelianov
2007-10-20 02:53:37 +0800
8793d854e Task Control Groups: make cpusets a client of cgroups ... Browse Code »

Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem

The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.

Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-10-20 02:53:36 +0800
b4f48b636 Task Control Groups: add fork()/exit() hooks ... Browse Code »

This adds the necessary hooks to the fork() and exit() paths to ensure
that new children inherit their parent's cgroup assignments, and that
exiting processes release reference counts on their cgroups.

Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-10-20 02:53:36 +0800

17 Oct, 2007

10 commits

42b2dd0a0 Shrink task_struct if CONFIG_FUTEX=n ... Browse Code »

robust_list, compat_robust_list, pi_state_list, pi_state_cache are
really used if futexes are on.

Signed-off-by: Alexey Dobriyan
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2007-10-17 23:42:55 +0800
6db840fa7 exec: RT sub-thread can livelock and monopolize CPU on exec ... Browse Code »

de_thread() yields waiting for ->group_leader to be a zombie. This deadlocks
if an rt-prio execer shares the same cpu with ->group_leader. Change the code
to use ->group_exit_task/notify_count mechanics.

This patch certainly uglifies the code, perhaps someone can suggest something
better.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:54 +0800
715015e8d wait_task_stopped/continued: remove unneeded p->signal != NULL check ... Browse Code »

The child was found on ->children list under tasklist_lock, it must have a
valid ->signal. __exit_signal() both removes the task from parent->children
and clears ->signal "atomically" under write_lock(tasklist).

Remove unneeded checks.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:52 +0800
442a10cf9 wait_task_zombie: don't fight with non-existing race with a dying ptracee ... Browse Code »

The "p->exit_signal == -1 && p->ptrace == 0" check and the comment are
bogus. We already did exactly the same check in eligible_child(), we did
not drop tasklist_lock since then, and both variables need
write_lock(tasklist) to be changed.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:51 +0800
3ae4cbadf exit_notify: don't take tasklist for TIF_SIGPENDING re-targeting ... Browse Code »

->siglock provides enough protection to iterate over the thread group.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:51 +0800
2f4e6e2a8 wait_task_zombie: fix 2/3 races vs forget_original_parent() ... Browse Code »

Two threads, T1 and T2. T2 ptraces P, and P is not a child of ptracer's
thread group. P exits and goes to TASK_ZOMBIE.

T1 does wait_task_zombie(P):

P->exit_state = TASK_DEAD;
...
read_unlock(&tasklist_lock);

T2 does exit(), takes tasklist,
forget_original_parent() does
__ptrace_unlink(P) but doesn't
call do_notify_parent(P) because
p->exit_state == EXIT_DEAD.

Now, P is not visible to our process: __ptrace_unlink() removed it from
->children. We should send notification to P->parent and release P if and
only if SIGCHLD is ignored.

And we have 3 bugs:

1. P->parent does do_wait() and gets -ECHILD (P is on ->parent->children,
but its state is TASK_DEAD).

2. // wait_task_zombie() continues

if (put_user(...)) {
// TODO: is this safe?
p->exit_state = EXIT_ZOMBIE;
return;
}

we return without notification/release, task_struct leaked.

Solution: ignore -EFAULT and proceed. It is an application's bug if
we can't fill infop/stat_addr (in case of VM_FAULT_OOM we have much
more problems).

3. // wait_task_zombie() continues

if (p->real_parent != p->parent) {
// Not taken, it was untraced'ed
...
}

release_task(p);

we released the task which we shouldn't.

Solution: check ->real_parent != ->parent before, under tasklist_lock,
but use ptrace_unlink() instead of __ptrace_unlink() to check ->ptrace.

This patch hopefully solves 2 and 3, the 1st bug will be fixed later, we need
some cleanups in forget_original_parent/reparent_thread.

However, the first race is very unlikely and not critical, so I hope it makes
sense to fix 1 and 2 for now.

4. Small cleanup: don't "restore" EXIT_ZOMBIE unless we know we are not going
to realease the child.

Signed-off-by: Oleg Nesterov
Cc: Ingo Molnar
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:51 +0800
407af46a9 wait_task_zombie: remove unneeded child->signal check ... Browse Code »

A zombie must have a valid ->signal, we are going to release it and
__exit_signal() starts with BUG_ON(!sig).

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:51 +0800
84eb646b6 handle the multi-threaded init's exit() properly ... Browse Code »

With or without this patch, multi-threaded init's are not fully supported,
but do_exit() is completely wrong. This becomes a real problem when we
support pid namespaces.

1. do_exit() panics when the main thread of /sbin/init exits. It should not
until the whole thread group exits. Move the code below, under the
"if (group_dead)" check.

Note: this means that forget_original_parent() can use an already dead
child_reaper()'s task_struct. This is OK for /sbin/init because

- do_wait() from alive sub-thread still can reap a zombie, we iterate
over all sub-thread's ->children lists

- do_notify_parent() will wakeup some alive sub-thread because it sends
the group-wide signal

However, we should remove choose_new_parent()->BUG_ON(reaper->exit_state)
for this.

2. We are playing games with ->nsproxy->pid_ns. This code is bogus today, and
it has to be changed anyway when we really support pid namespaces, just
remove it.

Signed-off-by: Oleg Nesterov
Roland McGrath
Cc: "Eric W. Biederman"
Cc: Sukadev Bhattiprolu
Cc: Serge Hallyn
Cc: Cedric Le Goater
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:51 +0800
d2ee7198c pi-futex: set PF_EXITING without taking ->pi_lock ... Browse Code »

It is a bit annoying that do_exit() takes ->pi_lock to set PF_EXITING. All
we need is to synchronize with lookup_pi_state() which saw this task
without PF_EXITING under ->pi_lock.

Change do_exit() to use spin_unlock_wait().

Signed-off-by: Oleg Nesterov
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:42:51 +0800
a9022e9cb Clean up duplicate includes in kernel/ ... Browse Code »

This patch cleans up duplicate includes in
kernel/

Signed-off-by: Jesper Juhl
Acked-by: Paul E. McKenney
Reviewed-by: Satyam Sharma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2007-10-17 23:42:48 +0800

15 Oct, 2007

1 commit

9ac52315d sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields ... Browse Code »

like for cpustat, introduce the "gtime" (guest time of the task) and
"cgtime" (guest time of the task children) fields for the
tasks. Modify signal_struct and task_struct.

Modify /proc//stat to display these new fields.

Signed-off-by: Laurent Vivier
Acked-by: Avi Kivity
Signed-off-by: Ingo Molnar

Laurent Vivier
2007-10-15 23:00:19 +0800

21 Sep, 2007

1 commit

b8fceee17 signalfd simplification ... Browse Code »

This simplifies signalfd code, by avoiding it to remain attached to the
sighand during its lifetime.

In this way, the signalfd remain attached to the sighand only during
poll(2) (and select and epoll) and read(2). This also allows to remove
all the custom "tsk == current" checks in kernel/signal.c, since
dequeue_signal() will only be called by "current".

I think this is also what Ben was suggesting time ago.

The external effect of this, is that a thread can extract only its own
private signals and the group ones. I think this is an acceptable
behaviour, in that those are the signals the thread would be able to
fetch w/out signalfd.

Signed-off-by: Davide Libenzi
Signed-off-by: Linus Torvalds

Davide Libenzi
2007-09-21 04:19:59 +0800

31 Aug, 2007

1 commit

f2ab6d888 Assign task_struct.exit_code before taskstats_exit() ... Browse Code »

taskstats.ac_exitcode is assigned to task_struct.exit_code in bacct_add_tsk()
through the following kernel function calls:

do_exit()
taskstats_exit()
fill_pid()
bacct_add_tsk()

The problem is that in do_exit(), task_struct.exit_code is set to 'code' only
after taskstats_exit() has been called. So we need to move the assignment
before taskstats_exit().

Signed-off-by: Jonathan Lim
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jonathan Lim
2007-08-31 16:42:22 +0800

04 Aug, 2007

1 commit

247284481 Kill some obsolete sub-thread-ptrace stuff ... Browse Code »

There is a couple of subtle checks which were needed to handle ptracing from
the same thread group. This was deprecated a long ago, imho this code just
complicates the understanding.

And, the "->parent->signal->flags & SIGNAL_GROUP_EXIT" check in exit_notify()
is not right. SIGNAL_GROUP_EXIT can mean exec(), not exit_group(). This means
ptracer can lose a ptraced zombie on exec(). Minor problem, but still the bug.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-08-04 06:06:33 +0800

20 Jul, 2007

1 commit

0c1eecfb3 Freezer: avoid freezing kernel threads prematurely ... Browse Code »

Kernel threads should not have TIF_FREEZE set when user space processes are
being frozen, since otherwise some of them might be frozen prematurely.
To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE
unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks()
check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags
when user space processes are to be frozen.

Namely, when user space processes are being frozen, we only should set
TIF_FREEZE for tasks that have p->mm different from NULL and don't have
PF_BORROWED_MM set in p->flags. For this reason task_lock() must be used to
prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which
p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p). Also, we
need to prevent the following scenario from happening:

* daemonize() is called by a task spawned from a user space code path
* freezer checks if the task has p->mm set and the result is positive
* task enters exit_mm() and clears its TIF_FREEZE
* freezer sets TIF_FREEZE for the task
* task calls try_to_freeze() and goes to the refrigerator, which is wrong at
that point

This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and
p->mm are examined and release it after TIF_FREEZE is set for p (or it turns
out that TIF_FREEZE should not be set).

Signed-off-by: Rafael J. Wysocki
Cc: Gautham R Shenoy
Cc: Pavel Machek
Cc: Nigel Cunningham
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2007-07-20 01:04:42 +0800

18 Jul, 2007

1 commit

831441862 Freezer: make kernel threads nonfreezable by default ... Browse Code »

Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves. This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.

It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.

The patch causes all kernel threads to be nonfreezable by default (ie. to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE. It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear. Additionally, it updates documentation to
describe the freezing of tasks more accurately.

[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki
Acked-by: Nigel Cunningham
Cc: Pavel Machek
Cc: Oleg Nesterov
Cc: Gautham R Shenoy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2007-07-18 01:23:02 +0800

17 Jul, 2007

2 commits

522ed7767 Audit: add TTY input auditing ... Browse Code »

Add TTY input auditing, used to audit system administrator's actions. This is
required by various security standards such as DCID 6/3 and PCI to provide
non-repudiation of administrator's actions and to allow a review of past
actions if the administrator seems to overstep their duties or if the system
becomes misconfigured for unknown reasons. These requirements do not make it
necessary to audit TTY output as well.

Compared to an user-space keylogger, this approach records TTY input using the
audit subsystem, correlated with other audit events, and it is completely
transparent to the user-space application (e.g. the console ioctls still
work).

TTY input auditing works on a higher level than auditing all system calls
within the session, which would produce an overwhelming amount of mostly
useless audit events.

Add an "audit_tty" attribute, inherited across fork (). Data read from TTYs
by process with the attribute is sent to the audit subsystem by the kernel.
The audit netlink interface is extended to allow modifying the audit_tty
attribute, and to allow sending explanatory audit events from user-space (for
example, a shell might send an event containing the final command, after the
interactive command-line editing and history expansion is performed, which
might be difficult to decipher from the TTY input alone).

Because the "audit_tty" attribute is inherited across fork (), it would be set
e.g. for sshd restarted within an audited session. To prevent this, the
audit_tty attribute is cleared when a process with no open TTY file
descriptors (e.g. after daemon startup) opens a TTY.

See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a
more detailed rationale document for an older version of this patch.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Miloslav Trmac
Cc: Al Viro
Cc: Alan Cox
Cc: Paul Fulghum
Cc: Casey Schaufler
Cc: Steve Grubb
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miloslav Trmac
2007-07-17 00:05:47 +0800
e18eecb8b Add generic exit-time stack-depth checking to CONFIG_DEBUG_STACK_USAGE ... Browse Code »

Add generic exit-time stack-depth checking to CONFIG_DEBUG_STACK_USAGE.

This also adds UML support.

Tested on UML and i386.

[akpm@linux-foundation.org: cleanups, speedups, tweaks]
Signed-off-by: Jeff Dike
Cc: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2007-07-17 00:05:38 +0800

10 Jul, 2007

3 commits

172ba844a sched: update delay-accounting to use CFS's precise stats ... Browse Code »

update delay-accounting to use CFS's precise stats.

Signed-off-by: Ingo Molnar

Balbir Singh
2007-07-10 00:52:00 +0800
e05606d33 sched: clean up the rt priority macros ... Browse Code »

clean up the rt priority macros, pointed out by Andrew Morton.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-07-10 00:51:59 +0800
f64f61145 sched: remove sched_exit() ... Browse Code »

remove sched_exit(): the elaborate dance of us trying to recover
timeslices given to child tasks never really worked.

CFS does not need it either.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-07-10 00:51:58 +0800