Doug / smarc-fsl-linux-kernel | Embedian Git Server

12 May, 2010

1 commit

34441427a revert "procfs: provide stack information for threads" and its fixup commits ... Browse Code »

Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.

Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.

Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.

Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.

Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.

This patch nearly undoes the effect of all these patches.

The reason for reverting these is it provides an unusable value in
field 28. For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address. This unpredictability of the stack_start value makes
it worthless. That includes the intended use of showing how much stack
space a thread has.

Other architectures will get different values. As an example, ia64
gets 0. The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.

I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") . If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured. Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.

I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") . I left the KSTK_ESP() change in
place as that seemed worthwhile.

Signed-off-by: Robin Holt
Cc: Stefani Seibold
Cc: KOSAKI Motohiro
Cc: Michal Simek
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2010-05-12 08:33:41 +0800

07 Mar, 2010

7 commits

76595f79d coredump: suppress uid comparison test if core output files are pipes ... Browse Code »

Modify uid check in do_coredump so as to not apply it in the case of
pipes.

This just got noticed in testing. The end of do_coredump validates the
uid of the inode for the created file against the uid of the crashing
process to ensure that no one can pre-create a core file with different
ownership and grab the information contained in the core when they
shouldn' tbe able to. This causes failures when using pipes for a core
dumps if the crashing process is not root, which is the uid of the pipe
when it is created.

The fix is simple. Since the check for matching uid's isn't relevant for
pipes (a process can't create a pipe that the uermodehelper code will open
anyway), we can just just skip it in the event ispipe is non-zero

Reverts a pipe-affecting change which was accidentally made in

: commit c46f739dd39db3b07ab5deb4e3ec81e1c04a91af
: Author: Ingo Molnar
: AuthorDate: Wed Nov 28 13:59:18 2007 +0100
: Commit: Linus Torvalds
: CommitDate: Wed Nov 28 10:58:01 2007 -0800
:
: vfs: coredumping fix

Signed-off-by: Neil Horman
Cc: Andi Kleen
Cc: Oleg Nesterov
Cc: Alan Cox
Cc: Al Viro
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2010-03-07 03:26:46 +0800
5c99cbf49 coredump: set ->group_exit_code for other CLONE_VM tasks too ... Browse Code »

User visible change.

do_coredump() kills all threads which share the same ->mm but only the
coredumping process gets the proper exit_code. Other tasks which share
the same ->mm die "silently" and return status == 0 to parent.

This is historical behaviour, not actually a bug. But I think Frank
Heckenbach rightly dislikes the current behaviour. Simple test-case:

#include
#include
#include
#include

int main(void)
{
int stat;

if (!fork()) {
if (!vfork())
kill(getpid(), SIGQUIT);
}

wait(&stat);
printf("stat=%x\n", stat);
return 0;
}

Before this patch it prints "stat=0" despite the fact the child was killed
by SIGQUIT. After this patch the output is "stat=3" which obviously makes
more sense.

Even with this patch, only the task which originates the coredumping gets
"|= 0x80" if the core was actually dumped, but at least the coredumping
signal is visible to do_wait/etc.

Reported-by: Frank Heckenbach
Signed-off-by: Oleg Nesterov
Acked-by: WANG Cong
Cc: Roland McGrath
Cc: Neil Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-03-07 03:26:46 +0800
30736a4d4 coredump: pass mm->flags as a coredump parameter for consistency ... Browse Code »

Pass mm->flags as a coredump parameter for consistency.

---
1787 if (mm->core_state || !get_dumpable(mm)) { mmap_sem);
1789 put_cred(cred);
1790 goto fail;
1791 }
1792
[...]
1798 if (get_dumpable(mm) == 2) { /* Setuid core dump mode */ fsuid = 0; /* Dump root private */
1801 }
---

Since dumpable bits are not protected by lock, there is a chance to change
these bits between (1) and (2).

To solve this issue, this patch copies mm->flags to
coredump_params.mm_flags at the beginning of do_coredump() and uses it
instead of get_dumpable() while dumping core.

This copy is also passed to binfmt->core_dump, since elf*_core_dump() uses
dump_filter bits in mm->flags.

[akpm@linux-foundation.org: fix merge]
Signed-off-by: Masami Hiramatsu
Acked-by: Roland McGrath
Cc: Hidehiro Kawai
Cc: Oleg Nesterov
Cc: Ingo Molnar
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masami Hiramatsu
2010-03-07 03:26:46 +0800
5ef097dd7 exec: create initial stack independent of PAGE_SIZE ... Browse Code »

Currently we create the initial stack based on the PAGE_SIZE. This is
unnecessary.

This creates this initial stack independent of the PAGE_SIZE.

It also bumps up the number of 4k pages allocated from 20 to 32, to
align with 64K page systems.

Signed-off-by: Michael Neuling
Cc: Helge Deller
Reviewed-by: KOSAKI Motohiro
Cc: Americo Wang
Cc: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Neuling
2010-03-07 03:26:33 +0800
d554ed895 fs: use rlimit helpers ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:29 +0800
5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800

23 Feb, 2010

1 commit

a17e18790 fs/exec.c: fix initial stack reservation ... Browse Code »

803bf5ec259941936262d10ecc84511b76a20921 ("fs/exec.c: restrict initial
stack space expansion to rlimit") attempts to limit the initial stack to
20*PAGE_SIZE. Unfortunately, in attempting ensure the stack is not
reduced in size, we ended up not changing the stack at all.

This size reduction check is not necessary as the expand_stack call does
this already.

This caused a regression in UML resulting in most guest processes being
killed.

Signed-off-by: Michael Neuling
Reviewed-by: KOSAKI Motohiro
Acked-by: WANG Cong
Cc: Anton Blanchard
Cc: Oleg Nesterov
Cc: James Morris
Cc: Serge Hallyn
Cc: Benjamin Herrenschmidt
Cc: Jouni Malinen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Neuling
2010-02-23 11:50:34 +0800

12 Feb, 2010

1 commit

803bf5ec2 fs/exec.c: restrict initial stack space expansion to rlimit ... Browse Code »

When reserving stack space for a new process, make sure we're not
attempting to expand the stack by more than rlimit allows.

This fixes a bug caused by b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
variable length argument support") and unmasked by
fc63cf237078c86214abcb2ee9926d8ad289da9b ("exec: setup_arg_pages() fails
to return errors").

This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
80K on 4K pages or 'ulimit -s 79') all processes will be killed before
they start. This is particularly bad with 64K pages, where a ulimit below
1280K will kill every process.

To test, do:

'ulimit -s 15; ls'

before and after the patch is applied. Before it's applied, 'ls' should
be killed. After the patch is applied, 'ls' should no longer be killed.

A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
correctly with this code.

4K pages should be fine to test with.

[kosaki.motohiro@jp.fujitsu.com: cleanup]
[akpm@linux-foundation.org: cleanup cleanup]
Signed-off-by: Michael Neuling
Signed-off-by: KOSAKI Motohiro
Cc: Americo Wang
Cc: Anton Blanchard
Cc: Oleg Nesterov
Cc: James Morris
Cc: Ingo Molnar
Cc: Serge Hallyn
Cc: Benjamin Herrenschmidt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Neuling
2010-02-12 05:59:43 +0800

03 Feb, 2010

1 commit

7ab02af42 Fix 'flush_old_exec()/setup_new_exec()' split ... Browse Code »

Commit 221af7f87b9 ("Split 'flush_old_exec' into two functions") split
the function at the point of no return - ie right where there were no
more error cases to check. That made sense from a technical standpoint,
but when we then also combined it with the actual personality setting
going in between flush_old_exec() and setup_new_exec(), it needs to be a
bit more careful.

In particular, we need to make sure that we really flush the old
personality bits in the 'flush' stage, rather than later in the 'setup'
stage, since otherwise we might be flushing the _new_ personality state
that we're just setting up.

So this moves the flags and personality flushing (and 'flush_thread()',
which is the arch-specific function that generally resets lazy FP state
etc) of the old process into flush_old_exec(), so that it doesn't affect
any state that execve() is setting up for the new process environment.

This was reported by Michal Simek as breaking his Microblaze qemu
environment.

Reported-and-tested-by: Michal Simek
Cc: Peter Anvin
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-02-03 04:37:44 +0800

30 Jan, 2010

1 commit

221af7f87 Split 'flush_old_exec' into two functions ... Browse Code »

'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.

Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.

As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.

This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.

Signed-off-by: H. Peter Anvin
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-01-30 00:22:01 +0800

18 Dec, 2009

2 commits

f6151dfea mm: introduce coredump parameter structure ... Browse Code »

Introduce coredump parameter data structure (struct coredump_params) to
simplify binfmt->core_dump() arguments.

Signed-off-by: Masami Hiramatsu
Suggested-by: Ingo Molnar
Cc: Hidehiro Kawai
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masami Hiramatsu
2009-12-18 07:45:31 +0800
9cd80bbb0 do_wait() optimization: do not place sub-threads on task_struct->children list ... Browse Code »

Thanks to Roland who pointed out de_thread() issues.

Currently we add sub-threads to ->real_parent->children list. This buys
nothing but slows down do_wait().

With this patch ->children contains only main threads (group leaders).
The only complication is that forget_original_parent() should iterate over
sub-threads by hand, and de_thread() needs another list_replace() when it
changes ->group_leader.

Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check (and we can unify do_wait_thread() and
ptrace_do_wait()).

This change can confuse the optimistic search in mm_update_next_owner(),
but this is fixable and minor.

Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Ingo Molnar
Cc: Ratan Nalumasu
Cc: Vitaly Mayatskikh
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-12-18 07:45:31 +0800

16 Dec, 2009

1 commit

4614a696b procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm ... Browse Code »

Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications. However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.

However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves. This sort of behavior is
available on other systems via the pthread_setname_np() interface.

This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz
Cc: Andi Kleen
Cc: Arjan van de Ven
Cc: Mike Fulton
Cc: Sean Foley
Cc: Darren Hart
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

john stultz
2009-12-16 00:53:24 +0800

03 Dec, 2009

1 commit

c84d6efd3 Merge branch 'master' into next Browse Code »

James Morris
2009-12-03 14:33:40 +0800

12 Nov, 2009

1 commit

fc63cf237 exec: setup_arg_pages() fails to return errors ... Browse Code »

In setup_arg_pages we work hard to assign a value to ret, but on exit we
always return 0.

Also remove a now duplicated exit path and branch to out_unlock instead.

Signed-off-by: Anton Blanchard
Acked-by: Serge Hallyn
Reviewed-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Blanchard
2009-11-12 23:25:58 +0800

25 Oct, 2009

1 commit

6c21a7fb4 LSM: imbed ima calls in the security hooks ... Browse Code »

Based on discussions on LKML and LSM, where there are consecutive
security_ and ima_ calls in the vfs layer, move the ima_ calls to
the existing security_ hooks.

Signed-off-by: Mimi Zohar
Signed-off-by: James Morris

Mimi Zohar
2009-10-25 12:22:48 +0800

24 Sep, 2009

5 commits

801460d0c task_struct cleanup: move binfmt field to mm_struct ... Browse Code »

Because the binfmt is not different between threads in the same process,
it can be moved from task_struct to mm_struct. And binfmt moudle is
handled per mm_struct instead of task_struct.

Signed-off-by: Hiroshi Shimamoto
Acked-by: Oleg Nesterov
Cc: Rusty Russell
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hiroshi Shimamoto
2009-09-24 22:21:05 +0800
964ee7df9 exec: fix set_binfmt() vs sys_delete_module() race ... Browse Code »

sys_delete_module() can set MODULE_STATE_GOING after
search_binary_handler() does try_module_get(). In this case
set_binfmt()->try_module_get() fails but since none of the callers
check the returned error, the task will run with the wrong old
->binfmt.

The proper fix should change all ->load_binary() methods, but we can
rely on fact that the caller must hold a reference to binfmt->module
and use __module_get() which never fails.

Signed-off-by: Oleg Nesterov
Acked-by: Rusty Russell
Cc: Hiroshi Shimamoto
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-24 22:21:01 +0800
61be228a0 exec: allow do_coredump() to wait for user space pipe readers to complete ... Browse Code »

Allow core_pattern pipes to wait for user space to complete

One of the things that user space processes like to do is look at metadata
for a crashing process in their /proc/ directory. this is racy
however, since do_coredump in the kernel doesn't wait for the user space
process to complete before it reaps the crashing process. This patch
corrects that. Allowing the kernel to wait for the user space process to
complete before cleaning up the crashing process. This is a bit tricky to
do for a few reasons:

1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition

I've discussed several solutions with Oleg Nesterov off-list about this,
and this is the one we've come up with. We add ourselves as a pipe reader
(to prevent premature cleanup of the pipe_inode_info), and remove
ourselves as a writer (to provide an EOF condition to the writer in user
space), then we iterate until the user space process exits (which we
detect by pipe->readers == 1, hence the > 1 check in the loop). When we
exit the loop, we restore the proper reader/writer values, then we return
and let filp_close in do_coredump clean up the pipe data properly.

Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800
a293980c2 exec: let do_coredump() limit the number of concurrent dumps to pipes ... Browse Code »

Introduce core pipe limiting sysctl.

Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.

If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.

This sysctl introduces an ability to limit that resource consumption.
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log.
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800
725eae32d exec: make do_coredump() more resilient to recursive crashes ... Browse Code »

Change how we detect recursive dumps.

Currently we have a mechanism by which we try to compare pathnames of the
crashing process to the core_pattern path. This is broken for a dozen
reasons, and just doesn't work in any sort of robust way.

I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps
set RLIMIT_CORE to zero, we don't write out core files for any process
with that particular limit set. It the core_pattern is a pipe, any
non-zero limit is translated to RLIM_INFINITY.

This allows complete dumps to be captured, but prevents infinite recursion
in the event that the core_pattern process itself crashes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800

23 Sep, 2009

2 commits

d899bf7b5 procfs: provide stack information for threads ... Browse Code »

A patch to give a better overview of the userland application stack usage,
especially for embedded linux.

Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.

There is an enhancement in the /proc//{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.

A sample output of /proc//task//maps looks like:

08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

Also there is a new entry "stack usage" in /proc//{task/*,}/status
which will you give the current stack usage in kb.

A sample output of /proc/self/status looks like:

Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB

I also fixed stack base address in /proc//{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.

[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-09-23 22:39:41 +0800
1f10206cf getrusage: fill ru_maxrss value ... Browse Code »

Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
mark. This struct is filled as a parameter to getrusage syscall.
->ru_maxrss value is set to KBs which is the way it is done in BSD
systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
which seems to be incorrect behavior. Maintainer of this util was
notified by me with the patch which corrects it and cc'ed.

To make this happen we extend struct signal_struct by two fields. The
first one is ->maxrss which we use to store rss hiwater of the task. The
second one is ->cmaxrss which we use to store highest rss hiwater of all
task childs. These values are used in k_getrusage() to actually fill
->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
struct exists.

Note:
exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
it is intetionally behavior. *BSD getrusage have exec() inheriting.

test programs
========================================================

getrusage.c
===========
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

#include "common.h"

#define err(str) perror(str), exit(1)

int main(int argc, char** argv)
{
int status;

printf("allocate 100MB\n");
consume(100);

printf("testcase1: fork inherit? \n");
printf(" expect: initial.self ~= child.self\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
} else {
show_rusage("fork child");
_exit(0);
}
printf("\n");

printf("testcase2: fork inherit? (cont.) \n");
printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
} else {
show_rusage("child");
_exit(0);
}
printf("\n");

printf("testcase3: fork + malloc \n");
printf(" expect: child.self ~= initial.self + 50MB\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
} else {
printf("allocate +50MB\n");
consume(50);
show_rusage("fork child");
_exit(0);
}
printf("\n");

printf("testcase4: grandchild maxrss\n");
printf(" expect: post_wait.children ~= 300MB\n");
show_rusage("initial");
if (__fork()) {
wait(&status);
show_rusage("post_wait");
} else {
system("./child -n 0 -g 300");
_exit(0);
}
printf("\n");

printf("testcase5: zombie\n");
printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
show_rusage("initial");
if (__fork()) {
sleep(1); /* children become zombie */
show_rusage("pre_wait");
wait(&status);
show_rusage("post_wait");
} else {
system("./child -n 400");
_exit(0);
}
printf("\n");

printf("testcase6: SIG_IGN\n");
printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
show_rusage("initial");
signal(SIGCHLD, SIG_IGN);
if (__fork()) {
sleep(1); /* children become zombie */
show_rusage("after_zombie");
} else {
system("./child -n 500");
_exit(0);
}
printf("\n");
signal(SIGCHLD, SIG_DFL);

printf("testcase7: exec (without fork) \n");
printf(" expect: initial ~= exec \n");
show_rusage("initial");
execl("./child", "child", "-v", NULL);

return 0;
}

child.c
=======
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

#include "common.h"

int main(int argc, char** argv)
{
int status;
int c;
long consume_size = 0;
long grandchild_consume_size = 0;
int show = 0;

while ((c = getopt(argc, argv, "n:g:v")) != -1) {
switch (c) {
case 'n':
consume_size = atol(optarg);
break;
case 'v':
show = 1;
break;
case 'g':

grandchild_consume_size = atol(optarg);
break;
default:
break;
}
}

if (show)
show_rusage("exec");

if (consume_size) {
printf("child alloc %ldMB\n", consume_size);
consume(consume_size);
}

if (grandchild_consume_size) {
if (fork()) {
wait(&status);
} else {
printf("grandchild alloc %ldMB\n", grandchild_consume_size);
consume(grandchild_consume_size);

exit(0);
}
}

return 0;
}

common.c
========
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

#include "common.h"
#define err(str) perror(str), exit(1)

void show_rusage(char *prefix)
{
int err, err2;
struct rusage rusage_self;
struct rusage rusage_children;

printf("%s: ", prefix);
err = getrusage(RUSAGE_SELF, &rusage_self);
if (!err)
printf("self %ld ", rusage_self.ru_maxrss);
err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
if (!err2)
printf("children %ld ", rusage_children.ru_maxrss);

printf("\n");
}

/* Some buggy OS need this worthless CPU waste. */
void make_pagefault(void)
{
void *addr;
int size = getpagesize();
int i;

for (i=0; i
Signed-off-by: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Hugh Dickins
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Pirko
2009-09-23 22:39:30 +0800

21 Sep, 2009

1 commit

cdd6c482c perf: Do the big rename: Performance Counters -> Performance Events ... Browse Code »

Bye-bye Performance Counters, welcome Performance Events!

In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.

Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.

All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)

The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.

Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.

User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)

This patch has been generated via the following script:

FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES

for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done

FILES=$(find . -name perf_event.*)

sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES

... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.

Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.

( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )

Suggested-by: Stephane Eranian
Acked-by: Peter Zijlstra
Acked-by: Paul Mackerras
Reviewed-by: Arjan van de Ven
Cc: Mike Galbraith
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: Kyle McMartin
Cc: Martin Schwidefsky
Cc: "David S. Miller"
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc:
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2009-09-21 20:28:04 +0800

06 Sep, 2009

1 commit

a2a8474c3 exec: do not sleep in TASK_TRACED under ->cred_guard_mutex ... Browse Code »

Tom Horsley reports that his debugger hangs when it tries to read
/proc/pid_of_tracee/maps, this happens since

"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
04b836cbf19e885f8366bccb2e4b0474346c02d

commit in 2.6.31.

But the root of the problem lies in the fact that do_execve() path calls
tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.

The tracee must not sleep in TASK_TRACED holding this mutex. Even if we
remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
another task doing PTRACE_ATTACH should not hang until it is killed or the
tracee resumes.

With this patch do_execve() does not use ->cred_guard_mutex directly and
we do not hold it throughout, instead:

- introduce prepare_bprm_creds() helper, it locks the mutex
and calls prepare_exec_creds() to initialize bprm->cred.

- install_exec_creds() drops the mutex after commit_creds(),
and thus before tracehook_report_exec()->ptrace_stop().

or, if exec fails,

free_bprm() drops this mutex when bprm->cred != NULL which
indicates install_exec_creds() was not called.

Reported-by: Tom Horsley
Signed-off-by: Oleg Nesterov
Acked-by: David Howells
Cc: Roland McGrath
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-06 02:30:42 +0800

24 Aug, 2009

1 commit

6777d773a kernel_read: redefine offset type ... Browse Code »

vfs_read() offset is defined as loff_t, but kernel_read()
offset is only defined as unsigned long. Redefine
kernel_read() offset as loff_t.

Cc: stable@kernel.org
Signed-off-by: Mimi Zohar
Signed-off-by: James Morris

Mimi Zohar
2009-08-24 12:58:23 +0800

07 Jul, 2009

1 commit

793285fca cred_guard_mutex: do not return -EINTR to user-space ... Browse Code »

do_execve() and ptrace_attach() return -EINTR if
mutex_lock_interruptible(->cred_guard_mutex) fails.

This is not right, change the code to return ERESTARTNOINTR.

Perhaps we should also change proc_pid_attr_write().

Signed-off-by: Oleg Nesterov
Cc: David Howells
Acked-by: Roland McGrath
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-07-07 04:57:04 +0800

12 Jun, 2009

1 commit

8a1ca8ced Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/tip/linux-2.6-tip

* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
perf_counter: Turn off by default
perf_counter: Add counter->id to the throttle event
perf_counter: Better align code
perf_counter: Rename L2 to LL cache
perf_counter: Standardize event names
perf_counter: Rename enums
perf_counter tools: Clean up u64 usage
perf_counter: Rename perf_counter_limit sysctl
perf_counter: More paranoia settings
perf_counter: powerpc: Implement generalized cache events for POWER processors
perf_counters: powerpc: Add support for POWER7 processors
perf_counter: Accurate period data
perf_counter: Introduce struct for sample data
perf_counter tools: Normalize data using per sample period data
perf_counter: Annotate exit ctx recursion
perf_counter tools: Propagate signals properly
perf_counter tools: Small frequency related fixes
perf_counter: More aggressive frequency adjustment
perf_counter/x86: Fix the model number of Intel Core2 processors
perf_counter, x86: Correct some event and umask values for Intel processors
...

Linus Torvalds
2009-06-12 05:01:07 +0800

22 May, 2009

2 commits

2c9e703c6 Merge branch 'master' into next ... Browse Code »

Conflicts:
fs/exec.c

Removed IMA changes (the IMA checks are now performed via may_open()).

Signed-off-by: James Morris

James Morris
2009-05-22 16:40:59 +0800
b9fc745db integrity: path_check update ... Browse Code »

- Add support in ima_path_check() for integrity checking without
incrementing the counts. (Required for nfsd.)
- rename and export opencount_get to ima_counts_get
- replace ima_shm_check calls with ima_counts_get
- export ima_path_check

Signed-off-by: Mimi Zohar
Signed-off-by: James Morris

Mimi Zohar
2009-05-22 07:43:41 +0800

18 May, 2009

1 commit

dc3f81b12 Merge commit 'v2.6.30-rc6' into perfcounters/core ... Browse Code »

Merge reason: this branch was on an -rc4 base, merge it up to -rc6
to get the latest upstream fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-05-18 13:37:49 +0800

11 May, 2009

1 commit

5e751e992 CRED: Rename cred_exec_mutex to reflect that it's a guard against ptrace ... Browse Code »

Rename cred_exec_mutex to reflect that it's a guard against foreign
intervention on a process's credential state, such as is made by ptrace(). The
attachment of a debugger to a process affects execve()'s calculation of the new
credential state - _and_ also setprocattr()'s calculation of that state.

Signed-off-by: David Howells
Signed-off-by: James Morris

David Howells
2009-05-11 06:15:36 +0800

09 May, 2009

2 commits

6e8341a11 Switch open_exec() and sys_uselib() to do_open_filp() ... Browse Code »

... and make path_lookup_open() static

Signed-off-by: Al Viro

Al Viro
2009-05-09 22:49:42 +0800
a44ddbb6d Make open_exec() and sys_uselib() use may_open(), instead of duplicating its parts ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-05-09 22:49:42 +0800

03 May, 2009

1 commit

74641f584 alpha: binfmt_aout fix ... Browse Code »

This fixes the problem introduced by commit 3bfacef412 (get rid of
special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
when binfmt_aout built as module. That happens because aout binary
handler gets on the top of the binfmt list due to late registration, and
kernel attempts to execute the binary without preparatory work that must
be done by binfmt_loader.

Fixed by changing the registration order of the default binfmt handlers
using list_add_tail() and introducing insert_binfmt() function which
places new handler on the top of the binfmt list. This might be generally
useful for installing arch-specific frontends for default handlers or just
for overriding them.

Signed-off-by: Ivan Kokshaysky
Cc: Al Viro
Cc: Richard Henderson
Signed-off-by: Linus Torvalds

Ivan Kokshaysky
2009-05-03 06:36:10 +0800

29 Apr, 2009

1 commit

e7fd5d4b3 Merge branch 'linus' into perfcounters/core ... Browse Code »

Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up
the latest upstream fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-04-29 20:47:05 +0800

24 Apr, 2009

2 commits

437f7fdb6 check_unsafe_exec: s/lock_task_sighand/rcu_read_lock/ ... Browse Code »

write_lock(¤t->fs->lock) guarantees we can't wrongly miss
LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock()
instead of ->siglock to iterate over the sub-threads. We must see
all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it
takes fs->lock too.

With or without this patch we can miss the freshly cloned thread
and set LSM_UNSAFE_SHARE, we don't care.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
[ Fixed lock/unlock typo - Hugh ]
Acked-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-04-24 22:39:45 +0800
8c652f96d do_execve() must not clear fs->in_exec if it was set by another thread ... Browse Code »

If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
unconditionally. This is wrong if we race with our sub-thread which
also does do_execve:

Two threads T1 and T2 and another process P, all share the same
->fs.

T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
->fs is shared, we set LSM_UNSAFE but not ->in_exec.

P exits and decrements fs->users.

T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
shared, we set fs->in_exec.

T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
return to the user-space.

T1 does clone(CLONE_FS /* without CLONE_THREAD */).

T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
another process.

Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
do_execve() to clear ->in_exec depending on res.

When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
It can be set only if we don't share ->fs with another process, and since
we already killed all sub-threads either ->in_exec == 0 or we are the
only user of this ->fs.

Also, we do not need fs->lock to clear fs->in_exec.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Acked-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-04-24 22:39:45 +0800