Doug / smarc-fsl-linux-kernel | Embedian Git Server

25 Sep, 2009

3 commits

8b3f6af86 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ ... Browse Code »

Conflicts:
drivers/staging/Kconfig
drivers/staging/Makefile
drivers/staging/cpc-usb/TODO
drivers/staging/cpc-usb/cpc-usb_drv.c
drivers/staging/cpc-usb/cpc.h
drivers/staging/cpc-usb/cpc_int.h
drivers/staging/cpc-usb/cpcusb.h

David S. Miller
2009-09-25 06:13:11 +0800
a6b49cb21 Merge branch 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze ... Browse Code »

* 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze: (24 commits)
microblaze: Disable heartbeat/enable emaclite in defconfigs
microblaze: Support simpleImage.dts make target
microblaze: Fix _start symbol to physical address
microblaze: Use LOAD_OFFSET macro to get correct LMA for all sections
microblaze: Create the LOAD_OFFSET macro used to compute VMA vs LMA offsets
microblaze: Copy ppc asm-compat.h for clean handling of constants in asm and C
microblaze: Actually show KiB rather than pages in "Freeing initrd memory:"
microblaze: Support ptrace syscall tracing.
microblaze: Updated CPU version and FPGA family codes in PVR
microblaze: Generate correct signal and siginfo for integer div-by-zero
microblaze: Don't be noisy when userspace causes hardware exceptions
microblaze: Remove ipc.h file which points to non-existing asm-generic file
microblaze: Clear sticky FSR register after generating exception signals
microblaze: Ensure CPU usermode is set on new userspace processes
microblaze: Use correct kbuild variable KBUILD_CFLAGS
microblaze: Save and restore msr in hw exception
microblaze: Add architectural support for USB EHCI host controllers
microblaze: Implement include/asm/syscall.h.
microblaze: Improve checking mechanism for MSR instruction
microblaze: Add checking mechanism for MSR instruction
...

Linus Torvalds
2009-09-25 00:01:44 +0800
2c9871de0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
module: don't call percpu_modfree on NULL pointer.
module: fix memory leak when load fails after srcversion/version allocated
module: preferred way to use MODULE_AUTHOR
param: allow whitespace as kernel parameter separator
module: reduce string table for loaded modules (v2)
module: reduce symbol table for loaded modules (v2)

Linus Torvalds
2009-09-25 00:01:05 +0800

24 Sep, 2009

37 commits

6d39b27f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
lsm: Use a compressed IPv6 string format in audit events
Audit: send signal info if selinux is disabled
Audit: rearrange audit_context to save 16 bytes per struct
Audit: reorganize struct audit_watch to save 8 bytes

Linus Torvalds
2009-09-24 23:31:04 +0800
ffa9f12a4 module: don't call percpu_modfree on NULL pointer. ... Browse Code »

The general one handles NULL, the static obsolescent
(CONFIG_HAVE_LEGACY_PER_CPU_AREA) one in module.c doesn't; Eric's
commit 720eba31 assumed it did, and various frobbings since then kept
that assumption.

All other callers in module.c all protect it with an if; this effectively
does the same as free_init is only goto if we fail percpu_modalloc().

Reported-by: Kamalesh Babulal
Signed-off-by: Rusty Russell
Cc: Eric Dumazet
Cc: Masami Hiramatsu
Cc: Américo Wang
Tested-by: Kamalesh Babulal

Rusty Russell
2009-09-24 23:02:59 +0800
a263f7763 module: fix memory leak when load fails after srcversion/version allocated ... Browse Code »

Normally the twisty paths of sysfs will free the attributes, but not if
we fail before we hook it into sysfs (which is the last thing we do in
load_module).

(This sysfs code is a turd, no doubt there are other issues lurking too).

Reported-by: Tetsuo Handa
Signed-off-by: Rusty Russell
Cc: Catalin Marinas
Tested-by: Tetsuo Handa

Rusty Russell
2009-09-24 23:02:59 +0800
26d052bfc param: allow whitespace as kernel parameter separator ... Browse Code »

Some boot mechanisms require that kernel parameters are stored in a
separate file which is loaded to memory without further processing
(e.g. the "Load from FTP" method on s390). When such a file contains
newline characters, the kernel parameter preceding the newline might
not be correctly parsed (due to the newline being stuck to the end of
the actual parameter value) which can lead to boot failures.

This patch improves kernel command line usability in such a situation
by allowing generic whitespace characters as separators between kernel
parameters.

Signed-off-by: Peter Oberparleiter
Signed-off-by: Rusty Russell

Peter Oberparleiter
2009-09-24 23:02:58 +0800
554bdfe5a module: reduce string table for loaded modules (v2) ... Browse Code »

Also remove all parts of the string table (referenced by the symbol
table) that are not needed for kallsyms use (i.e. which were only
referenced by symbols discarded by the previous patch, or not
referenced at all for whatever reason).

Signed-off-by: Jan Beulich
Signed-off-by: Rusty Russell

Jan Beulich
2009-09-24 23:02:57 +0800
4a4962263 module: reduce symbol table for loaded modules (v2) ... Browse Code »

Discard all symbols not interesting for kallsyms use: absolute,
section, and in the common case (!KALLSYMS_ALL) data ones.

Signed-off-by: Jan Beulich
Signed-off-by: Rusty Russell

Jan Beulich
2009-09-24 23:02:57 +0800
db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800
801460d0c task_struct cleanup: move binfmt field to mm_struct ... Browse Code »

Because the binfmt is not different between threads in the same process,
it can be moved from task_struct to mm_struct. And binfmt moudle is
handled per mm_struct instead of task_struct.

Signed-off-by: Hiroshi Shimamoto
Acked-by: Oleg Nesterov
Cc: Rusty Russell
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hiroshi Shimamoto
2009-09-24 22:21:05 +0800
858f09930 aio: ifdef fields in mm_struct ... Browse Code »

->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.

Signed-off-by: Alexey Dobriyan
Cc: Zach Brown
Cc: Benjamin LaHaise
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:05 +0800
e5a473869 pidns: deny CLONE_PARENT|CLONE_NEWPID combination ... Browse Code »

CLONE_PARENT was used to implement an older threading model. For
consistency with the CLONE_THREAD check in copy_pid_ns(), disable
CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of
pid namespaces are clear.

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Roland McGrath
Acked-by: Serge Hallyn
Cc: Oren Laadan
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2009-09-24 22:21:04 +0800
123be07b0 fork(): disable CLONE_PARENT for init ... Browse Code »

When global or container-init processes use CLONE_PARENT, they create a
multi-rooted process tree. Besides siblings of global init remain as
zombies on exit since they are not reaped by their parent (swapper). So
prevent global and container-inits from creating siblings.

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Eric W. Biederman
Acked-by: Roland McGrath
Cc: Oren Laadan
Cc: Oleg Nesterov
Cc: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2009-09-24 22:21:04 +0800
8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800
d9588725e signals: inline __fatal_signal_pending ... Browse Code »

__fatal_signal_pending inlines to one instruction on x86, probably two
instructions on other machines. It takes two longer x86 instructions just
to call it and test its return value, not to mention the function itself.

On my random x86_64 config, this saved 70 bytes of text (59 of those being
__fatal_signal_pending itself).

Signed-off-by: Roland McGrath
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2009-09-24 22:21:01 +0800
4a30debfb signals: introduce do_send_sig_info() helper ... Browse Code »

Introduce do_send_sig_info() and convert group_send_sig_info(),
send_sig_info(), do_send_specific() to use this helper.

Hopefully it will have more users soon, it allows to specify
specific/group behaviour via "bool group" argument.

Shaves 80 bytes from .text.

Signed-off-by: Oleg Nesterov
Cc: Peter Zijlstra
Cc: stephane eranian
Cc: Ingo Molnar
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:01 +0800
a293980c2 exec: let do_coredump() limit the number of concurrent dumps to pipes ... Browse Code »

Introduce core pipe limiting sysctl.

Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.

If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.

This sysctl introduces an ability to limit that resource consumption.
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log.
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800
ae6d2ed7b signals: tracehook_notify_jctl change ... Browse Code »

This changes tracehook_notify_jctl() so it's called with the siglock held,
and changes its argument and return value definition. These clean-ups
make it a better fit for what new tracing hooks need to check.

Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.

This also folds the finish_stop() function into its caller
do_signal_stop(). The function is short, called only once and only
unconditionally. It aids readability to fold it in.

[oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
[oleg@redhat.com: introduce tracehook_finish_jctl() helper]
Signed-off-by: Roland McGrath
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2009-09-24 22:21:00 +0800
b6fe2d117 wait_noreap_copyout(): check for ->wo_info != NULL ... Browse Code »

Current behaviour of sys_waitid() looks odd. If user passes infop ==
NULL, sys_waitid() returns success. When user additionally specifies flag
WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

This patch adds check for ->wo_info in wait_noreap_copyout().

User-visible change: starting from this commit, sys_waitid() always checks
infop != NULL and does not fail if it is NULL.

Signed-off-by: Vitaly Mayatskikh
Reviewed-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Mayatskikh
2009-09-24 22:21:00 +0800
dfe16dfa4 do_wait: fix sys_waitid()-specific behaviour ... Browse Code »

do_wait() checks ->wo_info to figure out who is the caller. If it's not
NULL the caller should be sys_waitid(), in that case do_wait() fixes up
the retval or zeros ->wo_info, depending on retval from underlying
function.

This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
incorrect value.

man 2 waitid says:

waitid(): returns 0 on success

Test-case:

int main(void)
{
if (fork())
assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

return 0;
}

Result:

Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

Move that code to sys_waitid().

User-visible change: sys_waitid() will return 0 on success, either
infop is set or not.

Note, there's another bug in wait_noreap_copyout() which affects
return value of sys_waitid(). It will be fixed in next patch.

Signed-off-by: Vitaly Mayatskikh
Reviewed-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Mayatskikh
2009-09-24 22:21:00 +0800
b6e763f07 wait_consider_task: kill "parent" argument ... Browse Code »

Kill the unused "parent" argument in wait_consider_task(), it was never used.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Ingo Molnar
Cc: Ratan Nalumasu
Cc: Vitaly Mayatskikh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
989264f46 do_wait-wakeup-optimization: simplify task_pid_type() ... Browse Code »

task_pid_type() is only used by eligible_pid() which has to check wo_type
!= PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
out ->pids[type] access, this shrinks .text a bit and simplifies the code.

The matches the behaviour of other similar helpers, say get_task_pid().
The caller must ensure that pid_type is valid, not the callee.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
5c01ba49e do_wait-wakeup-optimization: fix child_wait_callback()->eligible_child() usage ... Browse Code »

child_wait_callback()->eligible_child() is not right, we can miss the
wakeup if the task was detached before __wake_up_parent() and the caller
of do_wait() didn't use __WALL.

Move ->wo_pid checks from eligible_child() to the new helper,
eligible_pid(), and change child_wait_callback() to use it instead of
eligible_child().

Note: actually I think it would be better to fix the __WCLONE check in
eligible_child(), it doesn't look exactly right. But it is not clear what
is the supposed behaviour, and any change is user-visible.

Reported-by: KAMEZAWA Hiroyuki
Tested-by: KAMEZAWA Hiroyuki
Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
b4fe51823 do_wait() wakeup optimization: child_wait_callback: check __WNOTHREAD case ... Browse Code »

Suggested by Roland.

do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
it is ->real_parent and the child is not traced. IOW, caller == p->parent
otherwise we should not wake up.

Change child_wait_callback() to check this. Ratan reports the workload with
CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Ingo Molnar
Cc: Ratan Nalumasu
Cc: Vitaly Mayatskikh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:00 +0800
0b7570e77 do_wait() wakeup optimization: change __wake_up_parent() to use filtered wakeup ... Browse Code »

Ratan Nalumasu reported that in a process with many threads doing
unnecessary wakeups. Every waiting thread in the process wakes up to loop
through the children and see that the only ones it cares about are still
not ready.

Now that we have struct wait_opts we can change do_wait/__wake_up_parent
to use filtered wakeups.

We can make child_wait_callback() more clever later, right now it only
checks eligible_child().

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Ingo Molnar
Cc: Ratan Nalumasu
Cc: Vitaly Mayatskikh
Acked-by: James Morris
Tested-by: Valdis Kletnieks
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:20:59 +0800
a2322e1d2 do_wait() wakeup optimization: shift security_task_wait() from eligible_child() … ... Browse Code »

…to wait_consider_task()

Preparation, no functional changes.

eligible_child() has a single caller, wait_consider_task(). We can move
security_task_wait() out from eligible_child(), this allows us to use it
for filtered wake_up().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Oleg Nesterov
2009-09-24 22:20:59 +0800
a7f0765ed ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee ... Browse Code »

The bug is old, it wasn't cause by recent changes.

Test case:

static void *tfunc(void *arg)
{
int pid = (long)arg;

assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
kill(pid, SIGKILL);

sleep(1);
return NULL;
}

int main(void)
{
pthread_t th;
long pid = fork();

if (!pid)
pause();

signal(SIGCHLD, SIG_IGN);
assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

int r = waitpid(-1, NULL, __WNOTHREAD);
printf("waitpid: %d %m\n", r);

return 0;
}

Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.

The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
we should wake up other threads which can sleep in do_wait().

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Vitaly Mayatskikh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:20:59 +0800
f64c3f549 memory controller: soft limit organize cgroups ... Browse Code »

Organize cgroups over soft limit in a RB-Tree

Introduce an RB-Tree for storing memory cgroups that are over their soft
limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

[hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
Signed-off-by: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800
296c81d89 memory controller: soft limit interface ... Browse Code »

Add an interface to allow get/set of soft limits. Soft limits for memory
plus swap controller (memsw) is currently not supported. Resource
counters have been enhanced to support soft limits and new type
RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
directly set and do not need any reclaim or checks before setting them to
a newer value.

Kamezawa-San raised a question as to whether soft limit should belong to
res_counter. Since all resources understand the basic concepts of hard
and soft limits, it is justified to add soft limits here. Soft limits are
a generic resource usage feature, even file system quotas support soft
limits.

Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800
be367d099 cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time ... Browse Code »

Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.

[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Acked-by: Li Zefan
Reviewed-by: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2009-09-24 22:20:58 +0800
c378369d8 cgroups: change css_set freeing mechanism to be under RCU ... Browse Code »

Changes css_set freeing mechanism to be under RCU

This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.

Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Cc: "Paul E. McKenney"
Acked-by: Li Zefan
Cc: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2009-09-24 22:20:58 +0800
d1d9fd330 cgroups: use vmalloc for large cgroups pidlist allocations ... Browse Code »

Separates all pidlist allocation requests to a separate function that
judges based on the requested size whether or not the array needs to be
vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.

Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Acked-by: Li Zefan
Cc: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2009-09-24 22:20:58 +0800
72a8cb30d cgroups: ensure correct concurrent opening/reading of pidlists across pid namespaces ... Browse Code »

Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process
seeing results from the other's namespace. Rather than one pidlist for
each file in a cgroup, we now keep a list of pidlists keyed by namespace
and file type (tasks versus procs) in which entries are placed on demand.
Each pidlist has its own lock, and that the pidlists themselves are passed
around in the seq_file's private pointer means we don't have to touch the
cgroup or its master list except when creating and destroying entries.

Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Cc: Li Zefan
Cc: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2009-09-24 22:20:58 +0800
102a775e3 cgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids ... Browse Code »

struct cgroup used to have a bunch of fields for keeping track of the
pidlist for the tasks file. Those are now separated into a new struct
cgroup_pidlist, of which two are had, one for procs and one for tasks.
The way the seq_file operations are set up is changed so that just the
pidlist struct gets passed around as the private data.

Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$

Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Acked-by: Li Zefan
Cc: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2009-09-24 22:20:58 +0800
8f3ff2086 cgroups: revert "cgroups: fix pid namespace bug" ... Browse Code »

The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.

The subsystem "attach" interface is modified to support attaching whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need to access the old cgroup of every thread being
moved. The attach interface may need to be revised if this becomes the
case.

Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
with the sighand's atomic count. This scheme should introduce no extra
overhead in the fork path when there's no contention.

The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through
the attach() calls, though no subsystem yet demonstrates need for an
extended critical section.

This patch:

Revert

commit 096b7fe012d66ed55e98bc8022405ede0cc80e96
Author: Li Zefan
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit: Linus Torvalds
CommitDate: Wed Jul 29 19:10:35 2009 -0700

cgroups: fix pid namespace bug

This is in preparation for some clashing cgroups changes that subsume the
original commit's functionaliy.

The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches. I played around with trying to reconcile Ben's patch
series with Li's patch, but concluded that it was simpler to just revert
Li's, given that Ben's patch series contained essentially the same fix.

Signed-off-by: Paul Menage
Cc: Li Zefan
Cc: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-09-24 22:20:58 +0800
2c6ab6d20 cgroups: allow cgroup hierarchies to be created with no bound subsystems ... Browse Code »

This patch removes the restriction that a cgroup hierarchy must have at
least one bound subsystem. The mount option "none" is treated as an
explicit request for no bound subsystems.

A hierarchy with no subsystems can be useful for plain task tracking, and
is also a step towards the support for multiply-bindable subsystems.

As part of this change, the hierarchy id is no longer calculated from the
bitmask of subsystems in the hierarchy (since this is not guaranteed to be
unique) but is allocated via an ida. Reference counts on cgroups from
css_set objects are now taken explicitly one per hierarchy, rather than
one per subsystem.

Example usage:

mount -t cgroup -o none,name=foo cgroup /mnt/cgroup

Based on the "no-op"/"none" subsystem concept proposed by
kamezawa.hiroyu@jp.fujitsu.com

Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Dhaval Giani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-09-24 22:20:58 +0800
7717f7ba9 cgroups: add a back-pointer from struct cg_cgroup_link to struct cgroup ... Browse Code »

Currently the cgroups code makes the assumption that the subsystem
pointers in a struct css_set uniquely identify the hierarchy->cgroup
mappings associated with the css_set; and there's no way to directly
identify the associated set of cgroups other than by indirecting through
the appropriate subsystem state pointers.

This patch removes the need for that assumption by adding a back-pointer
from struct cg_cgroup_link object to its associated cgroup; this allows
the set of cgroups to be determined by traversing the cg_links list in
the struct css_set.

Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Dhaval Giani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-09-24 22:20:58 +0800
fe6934354 cgroups: move the cgroup debug subsys into cgroup.c to access internal state ... Browse Code »

While it's architecturally clean to have the cgroup debug subsystem be
completely independent of the cgroups framework, it limits its usefulness
for debugging the contents of internal data structures. Move the debug
subsystem code into the scope of all the cgroups data structures to make
more detailed debugging possible.

Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Dhaval Giani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-09-24 22:20:57 +0800
c6d57f331 cgroups: support named cgroups hierarchies ... Browse Code »

To simplify referring to cgroup hierarchies in mount statements, and to
allow disambiguation in the presence of empty hierarchies and
multiply-bindable subsystems this patch adds support for naming a new
cgroup hierarchy via the "name=" mount option

A pre-existing hierarchy may be specified by either name or by subsystems;
a hierarchy's name cannot be changed by a remount operation.

Example usage:

# To create a hierarchy called "foo" containing the "cpu" subsystem
mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1

# To mount the "foo" hierarchy on a second location
mount -t cgroup -oname=foo cgroup /mnt/cgroup2

Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Dhaval Giani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-09-24 22:20:57 +0800