Eric Lee / smarc-fsl-linux-kernel

20 Oct, 2007

40 commits

8256e47cd Linux Kernel Markers ... Browse Code »

The marker activation functions sits in kernel/marker.c. A hash table is used
to keep track of the registered probes and armed markers, so the markers
within a newly loaded module that should be active can be activated at module
load time.

marker_query has been removed. marker_get_first, marker_get_next and
marker_release should be used as iterators on the markers.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mathieu Desnoyers
Acked-by: "Frank Ch. Eigler"
Cc: Christoph Hellwig
Cc: Rusty Russell
Cc: Mike Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2007-10-20 02:53:54 +0800
68318b8e0 Hook up group scheduler with control groups ... Browse Code »

Enable "cgroup" (formerly containers) based fair group scheduling. This
will let administrator create arbitrary groups of tasks (using "cgroup"
pseudo filesystem) and control their cpu bandwidth usage.

[akpm@linux-foundation.org: fix cpp condition]
Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Dhaval Giani
Cc: Randy Dunlap
Cc: Balbir Singh
Cc: Paul Menage
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa Vaddagiri
2007-10-20 02:53:51 +0800
cb3808532 Use extended crashkernel command line on ia64 ... Browse Code »

This patch adapts IA64 to use the generic parse_crashkernel() function instead
of its own parsing for the crashkernel command line.

Because the total amount of System RAM must be known when calling this
function, efi_memmap_init() is modified to return its accumulated total_memory
variable.

Also, the crashkernel handling is moved in an own function in
arch/ia64/kernel/setup.c to make the code more readable.

[kamalesh@linux.vnet.ibm.com: build fix]
Signed-off-by: Bernhard Walle
Cc: "Luck, Tony"
Cc: Vivek Goyal
Cc: "Eric W. Biederman"
Signed-off-by: Kamalesh Babulal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bernhard Walle
2007-10-20 02:53:50 +0800
cba63c308 Extended crashkernel command line ... Browse Code »

This patch adds a extended crashkernel syntax that makes the value of reserved
system RAM dependent on the system RAM itself:

crashkernel=:[,:,...][@offset]
range=start-[end]

For example:

crashkernel=512M-2G:64M,2G-:128M

The motivation comes from distributors that configure their crashkernel
command line automatically with some configuration tool (YaST, you know ;)).
Of course that tool knows the value of System RAM, but if the user removes
RAM, then the system becomes unbootable or at least unusable and error
handling is very difficult.

This series implements this change for i386, x86_64, ia64, ppc64 and sh. That
should be all platforms that support kdump in current mainline. I tested all
platforms except sh due to the lack of a sh processor.

This patch:

This is the generic part of the patch. It adds a parse_crashkernel() function
in kernel/kexec.c that is called by the architecture specific code that
actually reserves the memory. That function takes the whole command line and
looks itself for "crashkernel=" in it.

If there are multiple occurrences, then the last one is taken. The advantage
is that if you have a bootloader like lilo or elilo which allows you to append
a command line parameter but not to remove one (like in GRUB), then you can
add another crashkernel value for testing at the boot command line and this
one overwrites the command line in the configuration then.

Signed-off-by: Bernhard Walle
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Paul Mundt
Cc: Vivek Goyal
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bernhard Walle
2007-10-20 02:53:49 +0800
3ac88a41f virtualization of sysv msg queues is incomplete ... Browse Code »

Virtualization of sysv msg queues is incomplete: msg_hdrs and msg_bytes
variables visible from userspace are global. Let's make them
per-namespace.

Signed-off-by: Alexey Kuznetsov
Signed-off-by: Kirill Korotaev
Cc: Pierre Peiffer
Cc: Nadia Derbey
Cc: Serge Hallyn
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Korotaev
2007-10-20 02:53:48 +0800
7ca7e564e ipc: store ipcs into IDRs ... Browse Code »

This patch introduces ipcs storage into IDRs. The main changes are:
. This ipc_ids structure is changed: the entries array is changed into a
root idr structure.
. The grow_ary() routine is removed: it is not needed anymore when adding
an ipc structure, since we are now using the IDR facility.
. The ipc_rmid() routine interface is changed:
. there is no need for this routine to return the pointer passed in as
argument: it is now declared as a void
. since the id is now part of the kern_ipc_perm structure, no need to
have it as an argument to the routine

Signed-off-by: Nadia Derbey
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nadia Derbey
2007-10-20 02:53:44 +0800
470fd6464 hotplug cpu: migrate a task within its cpuset ... Browse Code »

When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
been running on that cpu.

Currently, such a task is migrated:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any cpu which is both online and among that task's cpus_allowed

It is typical of a multithreaded application running on a large NUMA system to
have its tasks confined to a cpuset so as to cluster them near the memory that
they share. Furthermore, it is typical to explicitly place such a task on a
specific cpu in that cpuset. And in that case the task's cpus_allowed
includes only a single cpu.

This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).

With this patch, migrate the task to:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any online cpu within the task's cpuset
3) to any cpu which is both online and among that task's cpus_allowed

In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
not block. (name change - per Oleg's suggestion)

Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.

[akpm@linux-foundation.org: build fix]
[pj@sgi.com: Fix indentation and spacing]
Signed-off-by: Cliff Wickman
Cc: Oleg Nesterov
Cc: Christoph Lameter
Cc: Paul Jackson
Cc: Ingo Molnar
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2007-10-20 02:53:44 +0800
ba25f9dcc Use helpers to obtain task pid in printks ... Browse Code »

The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov
Cc: Dave Airlie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:43 +0800
9a2e70572 Isolate the explicit usage of signal->pgrp ... Browse Code »

The pgrp field is not used widely around the kernel so it is now marked as
deprecated with appropriate comment.

The initialization of INIT_SIGNALS is trimmed because
a) they are set to 0 automatically;
b) gcc cannot properly initialize two anonymous (the second one
is the one with the session) unions. In this particular case
to make it compile we'd have to add some field initialized
right before the .pgrp.

This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one
(from Cedric), but for the pgrp field.

Some progress report:

We have to deprecate the pid, tgid, session and pgrp fields on struct
task_struct and struct signal_struct. The session and pgrp are already
deprecated. The tgid value is close to being such - the worst known usage
in in fs/locks.c and audit code. The pid field deprecation is mainly
blocked by numerous printk-s around the kernel that print the tsk->pid to
log.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Cedric Le Goater
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: Herbert Poetzl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:43 +0800
14ed9d23a remove BITS_TO_TYPE macro ... Browse Code »

remove BITS_TO_TYPE macro

I realized, that it is actually the same as DIV_ROUND_UP, use it instead.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-10-20 02:53:42 +0800
93043ece0 define global BIT macro ... Browse Code »

define global BIT macro

move all local BIT defines to the new globally define macro.

Signed-off-by: Jiri Slaby
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Kumar Gala
Cc: Dmitry Torokhov
Cc: Jeff Garzik
Cc: James Bottomley
Cc: "Antonino A. Daplas"
Cc: Russell King
Acked-by: Ralf Baechle
Cc: "John W. Linville"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-10-20 02:53:42 +0800
7b19ada2e get rid of input BIT* duplicate defines ... Browse Code »

get rid of input BIT* duplicate defines

use newly global defined macros for input layer. Also remove includes of
input.h from non-input sources only for BIT macro definiton. Define the
macro temporarily in local manner, all those local definitons will be
removed further in this patchset (to not break bisecting).
BIT macro will be globally defined (1<
Cc:
Acked-by: Jiri Kosina
Cc:
Acked-by: Marcel Holtmann
Cc:
Acked-by: Mauro Carvalho Chehab
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-10-20 02:53:42 +0800
d05be13bc define first set of BIT* macros ... Browse Code »

define first set of BIT* macros

- move BITOP_MASK and BITOP_WORD from asm-generic/bitops/atomic.h to
include/linux/bitops.h and rename it to BIT_MASK and BIT_WORD
- move BITS_TO_LONGS and BITS_PER_BYTE to bitops.h too and allow easily
define another BITS_TO_something (e.g. in event.c) by BITS_TO_TYPE macro
Remaining (and common) BIT macro will be defined after all occurences and
conflicts will be sorted out in the patches.

Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-10-20 02:53:42 +0800
0624517d8 forbid asm/bitops.h direct inclusion ... Browse Code »

forbid asm/bitops.h direct inclusion

Because of compile errors that may occur after bit changes if asm/bitops.h is
included directly without e.g. linux/kernel.h which includes linux/bitops.h,
forbid direct inclusion of asm/bitops.h. Thanks to Adrian Bunk.

Signed-off-by: Jiri Slaby
Cc: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-10-20 02:53:41 +0800
1977f0327 remove asm/bitops.h includes ... Browse Code »

remove asm/bitops.h includes

including asm/bitops directly may cause compile errors. don't include it
and include linux/bitops instead. next patch will deny including asm header
directly.

Cc: Adrian Bunk
Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-10-20 02:53:41 +0800
bc552f771 Misc: phantom, improved data passing ... Browse Code »

This new version guarantees amb_bit switch in small enough intervals, so that
the device won't stop working in the middle of a movement anymore. However it
preserves old (openhaptics) functionality.

Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-10-20 02:53:41 +0800
8707d8b8c Fix cpusets update_cpumask ... Browse Code »

Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:

- collect batches of tasks under tasklist_lock and then call
set_cpus_allowed() on them outside the lock (since this can sleep).

- add a simple generic priority heap type to allow efficient collection
of batches of tasks to be processed without duplicating or missing any
tasks in subsequent batches.

- make "cpus" file update a no-op if the mask hasn't changed

- fix race between update_cpumask() and sched_setaffinity() by making
sched_setaffinity() post-check that it's not running on any cpus outside
cpuset_cpus_allowed().

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Paul Menage
Cc: Paul Jackson
Cc: David Rientjes
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Balbir Singh
Cc: Cedric Le Goater
Cc: "Eric W. Biederman"
Cc: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-10-20 02:53:41 +0800
029190c51 cpuset sched_load_balance flag ... Browse Code »

Add a new per-cpuset flag called 'sched_load_balance'.

When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.

When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.

Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.

If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those
CPUs.

Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.

This serves two purposes:
1) It provides a mechanism for real time isolation of some CPUs, and
2) it can be used to improve performance on systems with many CPUs
by supporting configurations in which load balancing is not done
across all CPUs at once, but rather only done in several smaller
disjoint sets of CPUs.

This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets

See further the Documentation and comments in the code itself.

[akpm@linux-foundation.org: don't be weird]
Signed-off-by: Paul Jackson
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-20 02:53:41 +0800
2f2a3a46f Uninline the task_xid_nr_ns() calls ... Browse Code »

Since these are expanded into call to pid_nr_ns() anyway, it's OK to move
the whole routine out-of-line. This is a cheap way to save ~100 bytes from
vmlinux. Together with the previous two patches, it saves half-a-kilo from
the vmlinux.

Un-inline other (currently inlined) functions must be done with additional
performance testing.

Signed-off-by: Pavel Emelyanov
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:41 +0800
8990571eb Uninline find_pid etc set of functions ... Browse Code »

The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
id, depending on whic id - global or virtual - is used.

The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
stack to call another function - find_pid_ns(). It turned out, that this
dereference together with the push itself cause the kernel text size to
grow too much.

Move all these out-of-line. Together with the previous patch this saves a
bit less that 400 bytes from .text section.

Signed-off-by: Pavel Emelyanov
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:41 +0800
bac0abd61 Isolate some explicit usage of task->tgid ... Browse Code »

With pid namespaces this field is now dangerous to use explicitly, so hide
it behind the helpers.

Also the pid and pgrp fields o task_struct and signal_struct are to be
deprecated. Unfortunately this patch cannot be sent right now as this
leads to tons of warnings, so start isolating them, and deprecate later.

Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
but Oleg pointed out that in case of posix cpu timers this is the same, and
thread_group_leader() is more preferable.

Signed-off-by: Pavel Emelyanov
Acked-by: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
19b9b9b54 pid namespaces: remove the struct pid unneeded fields ... Browse Code »

Since we've switched from using pid->nr to pid->upids->nr some
fields on struct pid are no longer needed

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
228ebcbe6 Uninline find_task_by_xxx set of functions ... Browse Code »

The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one. All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.

It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.

This patch moves all this stuff out-of-line into kernel/pid.c. Together
with the next patch it saves a bit less than 400 bytes from the .text
section.

Signed-off-by: Pavel Emelyanov
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
b488893a3 pid namespaces: changes to show virtual ids to user ... Browse Code »

This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Alexey Dobriyan
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
3eb07c8c8 pid namespaces: destroy pid namespace on init's death ... Browse Code »

Terminate all processes in a namespace when the reaper of the namespace is
exiting. We do this by walking the pidmap of the namespace and sending
SIGKILL to all processes.

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2007-10-20 02:53:40 +0800
6f4e64335 pid namespaces: initialize the namespace's proc_mnt ... Browse Code »

The namespace's proc_mnt must be kern_mount-ed to make this pointer always
valid, independently of whether the user space mounted the proc or not. This
solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL
to not-NULL.

The initialization is done after the init's pid is created and hashed to make
proc_get_sb() finr it and get for root inode.

Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
superblock holds the namespace we must explicitly break this circle to destroy
all the stuff. This is done after the init of the namespace dies. Running a
few steps forward - when init exits it will kill all its children, so no
proc_mnt will be needed after its death.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
30e49c263 pid namespaces: allow cloning of new namespace ... Browse Code »

When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
create a new struct pid for the new process. Allocate pid_t's for the new
process in the new pid namespace and all ancestor pid namespaces. Make the
newly cloned process the session and process group leader.

Since the active pid namespace is special and expected to be the first entry
in pid->upid_list, preserve the order of pid namespaces.

The size of 'struct pid' is dependent on the the number of pid namespaces the
process exists in, so we use multiple pid-caches'. Only one pid cache is
created during system startup and this used by processes that exist only in
init_pid_ns.

When a process clones its pid namespace, we create additional pid caches as
necessary and use the pid cache to allocate 'struct pids' for that depth.

Note, that with this patch the newly created namespace won't work, since the
rest of the kernel still uses global pids, but this is to be fixed soon. Init
pid namespace still works.

[oleg@tv-sign.ru: merge fix]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:39 +0800
b461cc038 pid namespaces: miscellaneous preparations for pid namespaces ... Browse Code »

* remove pid.h from pid_namespaces.h;
* rework is_(cgroup|global)_init;
* optimize (get|put)_pid_ns for init_pid_ns;
* declare task_child_reaper to return actual reaper.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:39 +0800
07543f5c7 pid namespaces: make proc have multiple superblocks - one for each namespace ... Browse Code »

Each pid namespace have to be visible through its own proc mount. Thus we
need to have per-namespace proc trees with their own superblocks.

We cannot easily show different pid namespace via one global proc tree, since
each pid refers to different tasks in different namespaces. E.g. pid 1
refers to the init task in the initial namespace and to some other task when
seeing from another namespace. Moreover - pid, exisintg in one namespace may
not exist in the other.

This approach has one move advantage is that the tasks from the init namespace
can see what tasks live in another namespace by reading entries from another
proc tree.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:39 +0800
198fe21b0 pid namespaces: helpers to find the task by its numerical ids ... Browse Code »

When searching the task by numerical id on may need to find it using global
pid (as it is done now in kernel) or by its virtual id, e.g. when sending a
signal to a task from one namespace the sender will specify the task's virtual
id and we should find the task by this value.

[akpm@linux-foundation.org: fix gfs2 linkage]
Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:39 +0800
7af572947 pid namespaces: helpers to obtain pid numbers ... Browse Code »

When showing pid to user or getting the pid numerical id for in-kernel use the
value of this id may differ depending on the namespace.

This set of helpers is used to get the global pid nr, the virtual (i.e. seen
by task in its namespace) nr and the nr as it is seen from the specified
namespace.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:39 +0800
8ef047aaa pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid ... Browse Code »

Each struct upid element of struct pid has to be initialized properly, i.e.
its nr mst be allocated from appropriate pidmap and ns set to appropriate
namespace.

When allocating a new pid, we need to know the namespace this pid will live
in, so the additional argument is added to alloc_pid().

On the other hand, the rest of the kernel still uses the pid->nr and
pid->pid_chain fields, so these ones are still initialized, but this will be
removed soon.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:39 +0800
faacbfd3a pid namespaces: add support for pid namespaces hierarchy ... Browse Code »

Each namespace has a parent and is characterized by its "level". Level is the
number of the namespace generation. E.g. init namespace has level 0, after
cloning new one it will have level 1, the next one - 2 and so on and so forth.
This level is not explicitly limited.

True hierarchy must have some way to find each namespace's children, but it is
not used in the patches, so this ability is not added (yet).

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:38 +0800
4c3f2ead5 pid namespaces: introduce struct upid ... Browse Code »

Since task will be visible from different pid namespaces each of them have to
be addressed by multiple pids. struct upid is to store the information about
which id refers to which namespace.

The constuciton looks like this. Each struct pid carried the reference
counter and the list of tasks attached to this pid. At its end it has a
variable length array of struct upid-s. Each struct upid has a numerical id
(pid itself), pointer to the namespace, this ID is valid in and is hashed into
a pid_hash for searching the pids.

The nr and pid_chain fields are kept in struct pid for a while to make kernel
still work (no patch initialize the upids yet), but it will be removed at the
end of this series when we switch to upids completely.

Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2007-10-20 02:53:38 +0800
60347f671 pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees ... Browse Code »

The first part is trivial - we just make the proc_flush_task() to operate on
arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
it.

The other change is more tricky: I moved the proc_flush_task() call in
release_task() higher to address the following problem.

When flushing task from many proc trees we need to know the set of ids (not
just one pid) to find the dentries' names to flush. Thus we need to pass the
task's pid to proc_flush_task() as struct pid is the only object that can
provide all the pid numbers. But after __exit_signal() task has detached all
his pids and this information is lost.

This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
tree and keep them in hash (since pids are still alive before __exit_signal())
till the next shrink, but since proc_flush_task() does not provide a 100%
guarantee that the dentries will be flushed, this is OK to do so.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:38 +0800
8bf9725c2 pid namespaces: introduce MS_KERNMOUNT flag ... Browse Code »

This flag tells the .get_sb callback that this is a kern_mount() call so that
it can trust *data pointer to be valid in-kernel one. If this flag is passed
from the user process, it is cleared since the *data pointer is not a valid
kernel object.

Running a few steps forward - this will be needed for proc to create the
superblock and store a valid pid namespace on it during the namespace
creation. The reason, why the namespace cannot live without proc mount is
described in the appropriate patch.

Signed-off-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:38 +0800
4e6045f13 workqueue: debug flushing deadlocks with lockdep ... Browse Code »

In the following scenario:

code path 1:
my_function() -> lock(L1); ...; flush_workqueue(); ...

code path 2:
run_workqueue() -> my_work() -> ...; lock(L1); ...

you can get a deadlock when my_work() is queued or running
but my_function() has acquired L1 already.

This patch adds a pseudo-lock to each workqueue to make lockdep
warn about this scenario.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Berg
Acked-by: Oleg Nesterov
Acked-by: Ingo Molnar
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Berg
2007-10-20 02:53:38 +0800
cf7b708c8 Make access to task's nsproxy lighter ... Browse Code »

When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists. This is
slow on read-only paths and may be impossible in some cases.

E.g. Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.

On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.

The access to other task namespaces is proposed to be performed
like this:

rcu_read_lock();
nsproxy = task_nsproxy(tsk);
if (nsproxy != NULL) {
/ *
* work with the namespaces here
* e.g. get the reference on one of them
* /
} / *
* NULL task_nsproxy() means that this task is
* almost dead (zombie)
* /
rcu_read_unlock();

This patch has passed the review by Eric and Oleg :) and,
of course, tested.

[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Serge Hallyn
Signed-off-by: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:37 +0800
b460cbc58 pid namespaces: define is_global_init() and is_container_init() ... Browse Code »

is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().

A cgroup init has it's tsk->pid == 1.

A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.

Changelog:

2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().

2.6.21-mm2-pidns2:

- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.

[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn
Signed-off-by: Sukadev Bhattiprolu
Acked-by: Pavel Emelianov
Cc: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Herbert Poetzel
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2007-10-20 02:53:37 +0800
88f21d818 pid namespaces: rename child_reaper() function ... Browse Code »

Rename the child_reaper() function to task_child_reaper() to be similar to
other task_* functions and to distinguish the function from 'struct
pid_namspace.child_reaper'.

Signed-off-by: Sukadev Bhattiprolu
Cc: Pavel Emelianov
Cc: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Serge Hallyn
Cc: Herbert Poetzel
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2007-10-20 02:53:37 +0800