Eric Lee / smarc-fsl-linux-kernel

17 May, 2008

1 commit

02afc6267 [PATCH] dup_fd() fixes, part 1 ... Browse Code »

Move the sucker to fs/file.c in preparation to the rest

Signed-off-by: Al Viro

Al Viro
2008-05-17 05:22:26 +0800

02 May, 2008

1 commit

9f3acc314 [PATCH] split linux/file.h ... Browse Code »

Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro

Al Viro
2008-05-02 01:08:16 +0800

30 Apr, 2008

1 commit

db51aeccd signals: microoptimize the usage of ->curr_target ... Browse Code »

Suggested by Roland McGrath.

Initialize signal->curr_target in copy_signal(). This way ->curr_target is
never == NULL, we can kill the check in __group_complete_signal's hot path.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:35 +0800

29 Apr, 2008

4 commits

925d1c401 procfs task exe symlink ... Browse Code »

The kernel implements readlink of /proc/pid/exe by getting the file from
the first executable VMA. Then the path to the file is reconstructed and
reported as the result.

Because of the VMA walk the code is slightly different on nommu systems.
This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
walking the VMAs to find the first executable file-backed VMA we store a
reference to the exec'd file in the mm_struct.

That reference would prevent the filesystem holding the executable file
from being unmounted even after unmapping the VMAs. So we track the number
of VM_EXECUTABLE VMAs and drop the new reference when the last one is
unmapped. This avoids pinning the mounted filesystem.

[akpm@linux-foundation.org: improve comments]
[yamamoto@valinux.co.jp: fix dup_mmap]
Signed-off-by: Matt Helsley
Cc: Oleg Nesterov
Cc: David Howells
Cc:"Eric W. Biederman"
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Hugh Dickins
Signed-off-by: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-04-29 23:06:17 +0800
6013f67fc ipc: sysvsem: force unshare(CLONE_SYSVSEM) when CLONE_NEWIPC ... Browse Code »

sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.

Fix, part 2: perform an implicit CLONE_SYSVSEM in CLONE_NEWIPC. CLONE_NEWIPC
creates a new IPC namespace, the task cannot access the existing semaphore
arrays after the unshare syscall. Thus the task can/must detach from the
existing undo list entries, too.

This fixes the kernel corruption, because it makes it impossible that
undo records from two different namespaces are in sysvsem.undo_list.

Signed-off-by: Manfred Spraul
Signed-off-by: Serge E. Hallyn
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Manfred Spraul
2008-04-29 23:06:14 +0800
9edff4ab1 ipc: sysvsem: implement sys_unshare(CLONE_SYSVSEM) ... Browse Code »

sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.

Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)

The original reason to not support it was the potential (inevitable?)
confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
inverse meaning of clone(CLONE_SYSVSEM).

Our two most reasonable options then appear to be (1) fully support
CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
(1).

Changelog:
Apr 16: SEH: switch to Manfred's alternative patch which
removes the unshare_semundo() function which
always refused CLONE_SYSVSEM.

Signed-off-by: Manfred Spraul
Signed-off-by: Serge E. Hallyn
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Manfred Spraul
2008-04-29 23:06:14 +0800
cf475ad28 cgroups: add an owner to the mm_struct ... Browse Code »

Remove the mem_cgroup member from mm_struct and instead adds an owner.

This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.

A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.

This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.

I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.

This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.

After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.

Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Hugh Dickins
Cc: Sudhir Kumar
Cc: YAMAMOTO Takashi
Cc: Hirokazu Takahashi
Cc: David Rientjes ,
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Pekka Enberg
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-04-29 23:06:10 +0800

28 Apr, 2008

2 commits

846a16bf0 mempolicy: rename mpol_copy to mpol_dup ... Browse Code »

This patch renames mpol_copy() to mpol_dup() because, well, that's what it
does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
existing mempolicy, allocates a new one and copies the contents.

In a later patch, I want to use the name mpol_copy() to copy the contents from
one mempolicy to another like, e.g., strcpy() does for strings.

Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Mel Gorman
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-04-28 23:58:23 +0800
f0be3d32b mempolicy: rename mpol_free to mpol_put ... Browse Code »

This is a change that was requested some time ago by Mel Gorman. Makes sense
to me, so here it is.

Note: I retain the name "mpol_free_shared_policy()" because it actually does
free the shared_policy, which is NOT a reference counted object. However, ...

The mempolicy object[s] referenced by the shared_policy are reference counted,
so mpol_put() is used to release the reference held by the shared_policy. The
mempolicy might not be freed at this time, because some task attached to the
shared object associated with the shared policy may be in the process of
allocating a page based on the mempolicy. In that case, the task performing
the allocation will hold a reference on the mempolicy, obtained via
mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
holding such a reference have called mpol_put() for the mempolicy.

Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Mel Gorman
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-04-28 23:58:23 +0800

27 Apr, 2008

2 commits

402b08622 s390: KVM preparation: provide hook to enable pgstes in user pagetable ... Browse Code »

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.

Signed-off-by: Martin Schwidefsky
Signed-off-by: Carsten Otte
Acked-by: Andrew Morton
Signed-off-by: Avi Kivity

Carsten Otte
2008-04-27 17:00:40 +0800
50704516f Fix uninitialized 'copy' in unshare_files ... Browse Code »

Arrgghhh...

Sorry about that, I'd been sure I'd folded that one, but it actually got
lost. Please apply - that breaks execve().

Signed-off-by: Al Viro
Tested-by: Ingo Molnar
Signed-off-by: Linus Torvalds

Al Viro
2008-04-27 00:24:31 +0800

25 Apr, 2008

3 commits

3b1253880 [PATCH] sanitize unshare_files/reset_files_struct ... Browse Code »

* let unshare_files() give caller the displaced files_struct
* don't bother with grabbing reference only to drop it in the
caller if it hadn't been shared in the first place
* in that form unshare_files() is trivially implemented via
unshare_fd(), so we eliminate the duplicate logics in fork.c
* reset_files_struct() is not just only called for current;
it will break the system if somebody ever calls it for anything
else (we can't modify ->files of somebody else). Lose the
task_struct * argument.

Signed-off-by: Al Viro

Al Viro
2008-04-25 21:23:59 +0800
fd8328be8 [PATCH] sanitize handling of shared descriptor tables in failing execve() ... Browse Code »

* unshare_files() can fail; doing it after irreversible actions is wrong
and de_thread() is certainly irreversible.
* since we do it unconditionally anyway, we might as well do it in do_execve()
and save ourselves the PITA in binfmt handlers, etc.
* while we are at it, binfmt_som actually leaked files_struct on failure.

As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
become unexported.

Signed-off-by: Al Viro

Al Viro
2008-04-25 21:23:53 +0800
6b335d9c8 [PATCH] close race in unshare_files() ... Browse Code »

updating current->files requires task_lock

Signed-off-by: Al Viro

Al Viro
2008-04-25 21:23:48 +0800

20 Apr, 2008

2 commits

2adee9b30 x86: fpu xstate split fix ... Browse Code »

Signed-off-by: Ingo Molnar
Signed-off-by: Thomas Gleixner

Suresh Siddha
2008-04-20 01:19:55 +0800
61c4628b5 x86, fpu: split FPU state from task struct - v5 ... Browse Code »

Split the FPU save area from the task struct. This allows easy migration
of FPU context, and it's generally cleaner. It also allows the following
two optimizations:

1) only allocate when the application actually uses FPU, so in the first
lazy FPU trap. This could save memory for non-fpu using apps. Next patch
does this lazy allocation.

2) allocate the right size for the actual cpu rather than 512 bytes always.
Patches enabling xsave/xrstor support (coming shortly) will take advantage
of this.

Signed-off-by: Suresh Siddha
Signed-off-by: Arjan van de Ven
Signed-off-by: Ingo Molnar
Signed-off-by: Thomas Gleixner

Suresh Siddha
2008-04-20 01:19:55 +0800

29 Mar, 2008

1 commit

1d4a788f1 memcgroup: fix spurious EBUSY on memory cgroup removal ... Browse Code »

Call mm_free_cgroup earlier. Otherwise a reference due to lazy mm switching
can prevent cgroup removal.

Signed-off-by: YAMAMOTO Takashi
Acked-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

YAMAMOTO Takashi
2008-03-29 05:45:21 +0800

15 Feb, 2008

1 commit

6ac08c39a Use struct path in fs_struct ... Browse Code »

* Use struct path in fs_struct.

Signed-off-by: Andreas Gruenbacher
Signed-off-by: Jan Blunck
Acked-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Blunck
2008-02-15 13:13:33 +0800

09 Feb, 2008

3 commits

7ad5b3a50 kernel: remove fastcall in kernel/* ... Browse Code »

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-02-09 01:22:31 +0800
6c5f3e7b4 Pidns: make full use of xxx_vnr() calls ... Browse Code »

Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace. After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.

Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.

Signed-off-by: Pavel Emelyanov
Reviewed-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-02-09 01:22:29 +0800
fea9d1755 ITIMER_REAL: convert to use struct pid ... Browse Code »

signal_struct->tsk points to the ->group_leader and thus we have the nasty
code in de_thread() which has to change it and restart ->real_timer if the
leader is changed.

Use "struct pid *leader_pid" instead. This also allows us to kill now
unneeded send_group_sig_info().

Signed-off-by: Oleg Nesterov
Acked-by: "Eric W. Biederman"
Cc: Davide Libenzi
Cc: Pavel Emelyanov
Acked-by: Roland McGrath
Acked-by: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-02-09 01:22:29 +0800

08 Feb, 2008

1 commit

78fb74669 Memory controller: accounting setup ... Browse Code »

Basic setup routines, the mm_struct has a pointer to the cgroup that
it belongs to and the the page has a page_cgroup associated with it.

Signed-off-by: Pavel Emelianov
Signed-off-by: Balbir Singh
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelianov
2008-02-08 00:42:18 +0800

07 Feb, 2008

2 commits

6b2fb3c65 idle_regs() must be __cpuinit ... Browse Code »

Fix the following section mismatch with CONFIG_HOTPLUG=n,
CONFIG_HOTPLUG_CPU=y:

WARNING: vmlinux.o(.text+0x399a6): Section mismatch: reference to .init.text.5:idle_regs (between 'fork_idle' and 'get_task_mm')

Signed-off-by: Adrian Bunk
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-07 02:41:08 +0800
d9ae90ac4 use __set_task_state() for TRACED/STOPPED tasks ... Browse Code »

1. It is much easier to grep for ->state change if __set_task_state() is used
instead of the direct assignment.

2. ptrace_stop() and handle_group_stop() use set_task_state() which adds the
unneeded mb() (btw even if we use mb() it is still possible that do_wait()
sees the new ->state but not ->exit_code, but this is ok).

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-02-07 02:41:00 +0800

06 Feb, 2008

3 commits

3b7391de6 capabilities: introduce per-process capability bounding set ... Browse Code »

The capability bounding set is a set beyond which capabilities cannot grow.
Currently cap_bset is per-system. It can be manipulated through sysctl,
but only init can add capabilities. Root can remove capabilities. By
default it includes all caps except CAP_SETPCAP.

This patch makes the bounding set per-process when file capabilities are
enabled. It is inherited at fork from parent. Noone can add elements,
CAP_SETPCAP is required to remove them.

One example use of this is to start a safer container. For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.

The bounding set will not affect pP and pE immediately. It will only
affect pP' and pE' after subsequent exec()s. It also does not affect pI,
and exec() does not constrain pI'. So to really start a shell with no way
of regain CAP_MKNOD, you would do

prctl(PR_CAPBSET_DROP, CAP_MKNOD);
cap_t cap = cap_get_proc();
cap_value_t caparray[1];
caparray[0] = CAP_MKNOD;
cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
cap_set_proc(cap);
cap_free(cap);

The following test program will get and set the bounding
set (but not pI). For instance

./bset get
(lists capabilities in bset)
./bset drop cap_net_raw
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)

************************************************************
cap_bound.c
************************************************************
#include
#include
#include
#include
#include
#include
#include

#ifndef PR_CAPBSET_READ
#define PR_CAPBSET_READ 23
#endif

#ifndef PR_CAPBSET_DROP
#define PR_CAPBSET_DROP 24
#endif

int usage(char *me)
{
printf("Usage: %s get\n", me);
printf(" %s drop \n", me);
return 1;
}

#define numcaps 32
char *captable[numcaps] = {
"cap_chown",
"cap_dac_override",
"cap_dac_read_search",
"cap_fowner",
"cap_fsetid",
"cap_kill",
"cap_setgid",
"cap_setuid",
"cap_setpcap",
"cap_linux_immutable",
"cap_net_bind_service",
"cap_net_broadcast",
"cap_net_admin",
"cap_net_raw",
"cap_ipc_lock",
"cap_ipc_owner",
"cap_sys_module",
"cap_sys_rawio",
"cap_sys_chroot",
"cap_sys_ptrace",
"cap_sys_pacct",
"cap_sys_admin",
"cap_sys_boot",
"cap_sys_nice",
"cap_sys_resource",
"cap_sys_time",
"cap_sys_tty_config",
"cap_mknod",
"cap_lease",
"cap_audit_write",
"cap_audit_control",
"cap_setfcap"
};

int getbcap(void)
{
int comma=0;
unsigned long i;
int ret;

printf("i know of %d capabilities\n", numcaps);
printf("capability bounding set:");
for (i=0; i< 0)
perror("prctl");
else if (ret==1)
printf("%s%s", (comma++) ? ", " : " ", captable[i]);
}
printf("\n");
return 0;
}

int capdrop(char *str)
{
unsigned long i;

int found=0;
for (i=0; i
Signed-off-by: Andrew G. Morgan
Cc: Stephen Smalley
Cc: James Morris
Cc: Chris Wright
Cc: Casey Schaufler a
Signed-off-by: "Serge E. Hallyn"
Tested-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2008-02-06 01:44:20 +0800
5e5419734 add mm argument to pte/pmd/pud/pgd_free ... Browse Code »

(with Martin Schwidefsky )

The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.

[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Martin Schwidefsky
Cc:
Signed-off-by: Kamalesh Babulal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2008-02-06 01:44:18 +0800
bdff746a3 clone: prepare to recycle CLONE_STOPPED ... Browse Code »

Ulrich says that we never used this clone flags and that nothing should be
using it.

As we're down to only a single bit left in clone's flags argument, let's add a
warning to check that no userspace is actually using it. Hopefully we will
be able to recycle it.

Roland said:

CLONE_STOPPED was previously used by some NTPL versions when under
thread_db (i.e. only when being actively debugged by gdb), but not for a
long time now, and it never worked reliably when it was used. Removing it
seems fine to me.

[akpm@linux-foundation.org: it looks like CLONE_DETACHED is being used]
Cc: Ulrich Drepper
Cc: Ingo Molnar
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2008-02-06 01:44:07 +0800

30 Jan, 2008

1 commit

6d4e4c4fc KVM: Disallow fork() and similar games when using a VM ... Browse Code »

We don't want the meaning of guest userspace changing under our feet.

Signed-off-by: Avi Kivity

Avi Kivity
2008-01-30 23:53:13 +0800

28 Jan, 2008

3 commits

fadad878c kernel: add CLONE_IO to specifically request sharing of IO contexts ... Browse Code »

syslets (or other threads/processes that want io context sharing) can
set this to enforce sharing of io context.

Signed-off-by: Jens Axboe

Jens Axboe
2008-01-28 17:50:36 +0800
d38ecf935 io context sharing: preliminary support ... Browse Code »

Detach task state from ioc, instead keep track of how many processes
are accessing the ioc.

Signed-off-by: Jens Axboe

Jens Axboe
2008-01-28 17:50:31 +0800
fd0928df9 ioprio: move io priority from task_struct to io_context ... Browse Code »

This is where it belongs and then it doesn't take up space for a
process that doesn't do IO.

Signed-off-by: Jens Axboe

Jens Axboe
2008-01-28 17:50:29 +0800

26 Jan, 2008

5 commits

9745512ce sched: latencytop support ... Browse Code »

LatencyTOP kernel infrastructure; it measures latencies in the
scheduler and tracks it system wide and per process.

Signed-off-by: Arjan van de Ven
Signed-off-by: Ingo Molnar

Arjan van de Ven
2008-01-26 04:08:34 +0800
6f505b164 sched: rt group scheduling ... Browse Code »

Extend group scheduling to also cover the realtime classes. It uses the time
limiting introduced by the previous patch to allow multiple realtime groups.

The hard time limit is required to keep behaviour deterministic.

The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
number of task groups. This is the worst case behaviour I can't seem to get out
of, the avg. case of the algorithms can be improved, I focused on correctness
and worst case.

[ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-01-26 04:08:30 +0800
e260be673 Preempt-RCU: implementation ... Browse Code »

This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -

http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

and the article-

http://lwn.net/Articles/253651/

This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption. As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the "classic" RCU implementations
and can be switched to at compiler.

Also includes RCU tracing summarized in debugfs.

[ akpm@linux-foundation.org: build fixes on non-preempt architectures ]

Signed-off-by: Gautham R Shenoy
Signed-off-by: Dipankar Sarma
Signed-off-by: Paul E. McKenney
Reviewed-by: Steven Rostedt
Signed-off-by: Ingo Molnar

Paul E. McKenney
2008-01-26 04:08:24 +0800
73fe6aae8 sched: add RT-balance cpu-weight ... Browse Code »

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time. Consider, for instance, softirq_timer and softirq_sched. A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run. So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
up behind the running task, we will fail to ever relocate the unbounded
task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
overloaded tasks over. It is therefore optimial to strive to avoid the
overhead all together if we can cheaply detect the condition before
overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask. A weight of 1 indicates that
the task cannot be migrated. We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running. We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins
Signed-off-by: Steven Rostedt
Signed-off-by: Ingo Molnar

Gregory Haskins
2008-01-26 04:08:07 +0800
82a1fcb90 softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks ... Browse Code »

this patch extends the soft-lockup detector to automatically
detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
printed the following way:

------------------>
INFO: task prctl:3042 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
prctl D fd5e3793 0 3042 2997
f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
Call Trace:
[] schedule_timeout+0x6d/0x8b
[] schedule_timeout_uninterruptible+0x15/0x17
[] msleep+0x10/0x16
[] sys_prctl+0x30/0x1e2
[] sysenter_past_esp+0x5f/0xa5
=======================
2 locks held by prctl/3042:
#0: (&sb->s_type->i_mutex_key#5){--..}, at: [] do_fsync+0x38/0x7a
#1: (jbd_handle){--..}, at: [] journal_start+0xc7/0xe9
: CPU hotplug fixes. ]
[ Andrew Morton : build warning fix. ]

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven

Ingo Molnar
2008-01-26 04:08:02 +0800

06 Dec, 2007

1 commit

5cd17569f fix clone(CLONE_NEWPID) ... Browse Code »

Currently we are complicating the code in copy_process, the clone ABI, and
if we fix the bugs sys_setsid itself, with an unnecessary open coded
version of sys_setsid.

So just simplify everything and don't special case the session and pgrp of
the initial process in a pid namespace.

Having this special case actually presents to user space the classic linux
startup conditions with session == pgrp == 0 for /sbin/init.

We already handle sending signals to processes in a child pid namespace.

We need to handle sending signals to processes in a parent pid namespace
for cases like SIGCHILD and SIGIO.

This makes nothing extra visible inside a pid namespace. So this extra
special case appears to have no redeeming merits.

Further removing this special case increases the flexibility of how we can
use pid namespaces, by not requiring the initial process in a pid namespace
to be a daemon.

Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-12-06 01:21:18 +0800

10 Nov, 2007

1 commit

3c90e6e99 sched: fix copy_namespace() <-> sched_fork() dependency in do_fork ... Browse Code »

Sukadev Bhattiprolu reported a kernel crash with control groups.
There are couple of problems discovered by Suka's test:

- The test requires the cgroup filesystem to be mounted with
atleast the cpu and ns options (i.e both namespace and cpu
controllers are active in the same hierarchy).

# mkdir /dev/cpuctl
# mount -t cgroup -ocpu,ns none cpuctl
(or simply)
# mount -t cgroup none cpuctl -> Will activate all controllers
in same hierarchy.

- The test invokes clone() with CLONE_NEWNS set. This causes a a new child
to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
cgroup_clone) and the child is attached to the new group (cgroup_clone->
attach_task->sched_move_task). At this point in time, the child's scheduler
related fields are uninitialized (including its on_rq field, which it has
inherited from parent). As a result sched_move_task thinks its on
runqueue, when it isn't.

As a solution to this problem, I moved sched_fork() call, which
initializes scheduler related fields on a new task, before
copy_namespaces(). I am not sure though whether moving up will
cause other side-effects. Do you see any issue?

- The second problem exposed by this test is that task_new_fair()
assumes that parent and child will be part of the same group (which
needn't be as this test shows). As a result, cfs_rq->curr can be NULL
for the child.

The solution is to test for curr pointer being NULL in
task_new_fair().

With the patch below, I could run ns_exec() fine w/o a crash.

Reported-by: Sukadev Bhattiprolu
Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Ingo Molnar

Srivatsa Vaddagiri
2007-11-10 05:39:39 +0800

30 Oct, 2007

2 commits

9301899be sched: fix /proc/<PID>/stat stime/utime monotonicity, part 2 ... Browse Code »

Extend Peter's patch to fix accounting issues, by keeping stime
monotonic too.

Signed-off-by: Balbir Singh
Signed-off-by: Ingo Molnar
Tested-by: Frans Pop

Balbir Singh
2007-10-30 07:26:32 +0800
73a2bcb0e sched: keep utime/stime monotonic ... Browse Code »

keep utime/stime monotonic.

cpustats use utime/stime as a ratio against sum_exec_runtime, as a
consequence it can happen - when the ratio changes faster than time
accumulates - that either can be appear to go backwards.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2007-10-30 04:18:11 +0800