Doug / smarc-fsl-linux-kernel | Embedian Git Server

08 Dec, 2009

1 commit

1557d3300 Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...

Linus Torvalds
2009-12-08 23:38:50 +0800

06 Dec, 2009

1 commit

897e81bea Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
sched, cputime: Introduce thread_group_times()
sched, cputime: Cleanups related to task_times()
Revert "sched, x86: Optimize branch hint in __switch_to()"
sched: Fix isolcpus boot option
sched: Revert 498657a478c60be092208422fefa9c7b248729c2
sched, time: Define nsecs_to_jiffies()
sched: Remove task_{u,s,g}time()
sched: Introduce task_times() to replace task_{u,s}time() pair
sched: Limit the number of scheduler debug messages
sched.c: Call debug_show_all_locks() when dumping all tasks
sched, x86: Optimize branch hint in __switch_to()
sched: Optimize branch hint in context_switch()
sched: Optimize branch hint in pick_next_task_fair()
sched_feat_write(): Update ppos instead of file->f_pos
sched: Sched_rt_periodic_timer vs cpu hotplug
sched, kvm: Fix race condition involving sched_in_preempt_notifers
sched: More generic WAKE_AFFINE vs select_idle_sibling()
sched: Cleanup select_task_rq_fair()
sched: Fix granularity of task_u/stime()
sched: Fix/add missing update_rq_clock() calls
...

Linus Torvalds
2009-12-06 07:30:49 +0800

03 Dec, 2009

1 commit

0cf55e1ec sched, cputime: Introduce thread_group_times() ... Browse Code »

This is a real fix for problem of utime/stime values decreasing
described in the thread:

http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc//stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

v2:
- remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: Spencer Candland
Cc: Americo Wang
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Stanislaw Gruszka
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-12-03 00:32:40 +0800

26 Nov, 2009

3 commits

d5b7c78e9 sched: Remove task_{u,s,g}time() ... Browse Code »

Now all task_{u,s}time() pairs are replaced by task_times().
And task_gtime() is too simple to be an inline function.

Cleanup them all.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: Stanislaw Gruszka
Cc: Spencer Candland
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Americo Wang
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-11-26 19:59:20 +0800
d180c5bcc sched: Introduce task_times() to replace task_{u,s}time() pair ... Browse Code »

Functions task_{u,s}time() are called in pair in almost all
cases. However task_stime() is implemented to call task_utime()
from its inside, so such paired calls run task_utime() twice.

It means we do heavy divisions (div_u64 + do_div) twice to get
utime and stime which can be obtained at same time by one set
of divisions.

This patch introduces a function task_times(*tsk, *utime,
*stime) to retrieve utime and stime at once in better, optimized
way.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: Stanislaw Gruszka
Cc: Spencer Candland
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Americo Wang
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-11-26 19:59:19 +0800
16bc67ede Merge branch 'sched/urgent' into sched/core ... Browse Code »

Merge reason: Pick up fixes that did not make it into .32.0

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-11-26 17:50:42 +0800

18 Nov, 2009

1 commit

9ebd4eba7 procfs: fix /proc/<pid>/stat stack pointer for kernel threads ... Browse Code »

Fix a small issue for the stack pointer in /proc//stat. In case of a
kernel thread the value of the printed stack pointer should be 0.

Signed-off-by: Stefani Seibold
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-11-18 09:40:33 +0800

17 Nov, 2009

1 commit

bb9074ff5 Merge commit 'v2.6.32-rc7' ... Browse Code »

Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.

Eric W. Biederman
2009-11-17 17:01:34 +0800

12 Nov, 2009

1 commit

29f12ca32 pidns: fix a leak in /proc dentries and inodes with pid namespaces. ... Browse Code »

Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
that is discussed in:

http://lkml.org/lkml/2009/10/2/159.

To summarize the thread, when container-init is terminated, it sets the
PF_EXITING flag, zaps other processes in the container and waits to reap
them. As a part of reaping, the container-init should flush any /proc
dentries associated with the processes. But because the container-init is
itself exiting and the following PF_EXITING check, the dentries are not
flushed, resulting in leak in /proc inodes and dentries.

This fix reverts the commit 7766755a2f249e7e0 ("Fix /proc dcache deadlock
in do_exit") which introduced the check for PF_EXITING. At the time of
the commit, shrink_dcache_parent() flushed dentries from other filesystems
also and could have caused a deadlock which the commit fixed. But as
pointed out by Eric Biederman, after commit 0feae5c47aabdde59,
shrink_dcache_parent() no longer affects other filesystems. So reverting
the commit is now safe.

As pointed out by Jan Kara, the leak is not as critical since the
unclaimed space will be reclaimed under memory pressure or by:

echo 3 > /proc/sys/vm/drop_caches

But since this check is no longer required, its best to remove it.

Signed-off-by: Sukadev Bhattiprolu
Reported-by: Daniel Lezcano
Acked-by: Eric W. Biederman
Acked-by: Jan Kara
Cc: Andrea Arcangeli
Cc: Serge Hallyn
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2009-11-12 23:25:57 +0800

11 Nov, 2009

1 commit

2315ffa0a sysctl: Don't look at ctl_name and strategy in the generic code ... Browse Code »

The ctl_name and strategy fields are unused, now that sys_sysctl
is a compatibility wrapper around /proc/sys. No longer looking
at them in the generic code is effectively what we are doing
now and provides the guarantee that during further cleanups
we can just remove references to those fields and everything
will work ok.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-11 16:53:43 +0800

29 Oct, 2009

1 commit

370c28def hwpoison: fix/proc/meminfo alignment ... Browse Code »

Given such a long name, the kB count in /proc/meminfo's HardwareCorrupted
line is being shown too far right (it does align with x86_64's VmallocChunk
above, but I hope nobody will ever have that much corrupted!). Align it.

Signed-off-by: Hugh Dickins
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-10-29 22:39:25 +0800

26 Oct, 2009

2 commits

ce0e7b28f sched, cpuacct: Fix niced guest time accounting ... Browse Code »

CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.

This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.

And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.

The original discussions can be found here:

http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html
http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html

Signed-off-by: Ryota Ozaki
Acked-by: Avi Kivity
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Ryota Ozaki
2009-10-26 00:31:30 +0800
0b9e31e92 Merge branch 'linus' into sched/core ... Browse Code »

Conflicts:
fs/proc/array.c

Merge reason: resolve conflict and queue up dependent patch.

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-10-26 00:30:53 +0800

08 Oct, 2009

2 commits

253fb02d6 pagemap: export KPF_HWPOISON ... Browse Code »

This flag indicates a hardware detected memory corruption on the page.
Any future access of the page data may bring down the machine.

Signed-off-by: Wu Fengguang
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-10-08 22:36:39 +0800
4055e9731 fs: includecheck fix: proc, kcore.c ... Browse Code »

fix the following 'make includecheck' warning:

fs/proc/kcore.c: linux/mm.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jaswinder Singh Rajput
2009-10-08 22:36:38 +0800

25 Sep, 2009

2 commits

c44972f17 procfs: disable per-task stack usage on NOMMU ... Browse Code »

It needs walk_page_range().

Reported-by: Michal Simek
Tested-by: Michal Simek
Cc: Stefani Seibold
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Cc: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2009-09-25 08:11:24 +0800
7ca263cdf Merge branch 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6 ... Browse Code »

* 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6:
[PATCH] Fix idle time field in /proc/uptime

Linus Torvalds
2009-09-25 00:04:24 +0800

24 Sep, 2009

3 commits

db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800
8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800
96830a57d [PATCH] Fix idle time field in /proc/uptime ... Browse Code »

Git commit 79741dd changes idle cputime accounting, but unfortunately
the /proc/uptime file hasn't caught up. Here the idle time calculation
from /proc/stat is copied over.

Signed-off-by: Michael Abbott
Signed-off-by: Martin Schwidefsky

Michael Abbott
2009-09-24 16:16:24 +0800

23 Sep, 2009

17 commits

0d4c36a9b /proc/kcore: update stat.st_size after memory hotplug ... Browse Code »

After memory hotplug (or other events in future), kcore size can be
modified.

To update inode->i_size, we have to know inode/dentry but we can't get it
from inside /proc directly. But considerinyg memory hotplug, kcore image
is updated only when it's opened. Then, updating inode->i_size at open()
is enough.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:42 +0800
678ad5d8a /proc/kcore: fix stat.st_size ... Browse Code »

Presently the size of /proc/kcore which can be read by 'ls -l' is 0. But
it's not the correct value.

On x86-64, ls -l shows
... root root 140737486266368 2009-09-17 10:29 /proc/kcore
Then, 7FFFFFFE02000. This comes from vmalloc area's size.
(*) This shows "core" size, not memory size.

This patch shows the size by updating "size" field in struct
proc_dir_entry. Later, lookup routine will create inode and fill
inode->i_size based on this value. Then, this has a problem.

- Once inode is cached, inode->i_size will never be updated.

Then, this patch is not memory-hotplug-aware.

To update inode->i_size, we have to know dentry or inode.
But there is no way to lookup them by inside kernel. Hmmm....
Next patch will try it.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:42 +0800
90396f96b kcore: more fixes for init ... Browse Code »

proc_kcore_init() doesn't check NULL case. fix it and remove unnecessary
comments.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:42 +0800
81ac3ad90 kcore: register module area in generic way ... Browse Code »

Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
This is handled only in x86-64. This patch make it more generic. And we
can use vread/vwrite to access the area. Fix it.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Jiri Slaby
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:42 +0800
26562c59f kcore: register vmemmap range ... Browse Code »

Benjamin Herrenschmidt pointed out that vmemmap
range is not included in KCORE_RAM, KCORE_VMALLOC ....

This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used. By this, vmemmap
can be readable via /proc/kcore

Because it's not vmalloc area, vread/vwrite cannot be used. But the range
is static against the memory layout, this patch handles vmemmap area by
the same scheme with physical memory.

This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range. It's
correct now.

[akpm@linux-foundation.org: fix typo]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Jiri Slaby
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Cc: Benjamin Herrenschmidt
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:41 +0800
3089aa1b0 kcore: use registerd physmem information ... Browse Code »

For /proc/kcore, each arch registers its memory range by kclist_add().
In usual,

- range of physical memory
- range of vmalloc area
- text, etc...

are registered but "range of physical memory" has some troubles. It
doesn't updated at memory hotplug and it tend to include unnecessary
memory holes. Now, /proc/iomem (kernel/resource.c) includes required
physical memory range information and it's properly updated at memory
hotplug. Then, it's good to avoid using its own code(duplicating
information) and to rebuild kclist for physical memory based on
/proc/iomem.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Jiri Slaby
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:41 +0800
9492587cf kcore: register text area in generic way ... Browse Code »

Some 64bit arch has special segment for mapping kernel text. It should be
entried to /proc/kcore in addtion to direct-linear-map, vmalloc area.
This patch unifies KCORE_TEXT entry scattered under x86 and ia64.

I'm not familiar with other archs (mips has its own even after this patch)
but range of [_stext ..._end) is a valid area of text and it's not in
direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary
thing to do.

Note: I left mips as it is now.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:41 +0800
a0614da88 kcore: register vmalloc area in generic way ... Browse Code »

For /proc/kcore, vmalloc areas are registered per arch. But, all of them
registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies
them. By this. archs which have no kclist_add() hooks can see vmalloc
area correctly.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:41 +0800
c30bb2a25 kcore: add kclist types ... Browse Code »

Presently, kclist_add() only eats start address and size as its arguments.
Considering to make kclist dynamically reconfigulable, it's necessary to
know which kclists are for System RAM and which are not.

This patch add kclist types as
KCORE_RAM
KCORE_VMALLOC
KCORE_TEXT
KCORE_OTHER

This "type" is used in a patch following this for detecting KCORE_RAM.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:41 +0800
2ef43ec77 kcore: use usual list for kclist ... Browse Code »

This patchset is for /proc/kcore. With this,

- many per-arch hooks are removed.

- /proc/kcore will know really valid physical memory area.

- /proc/kcore will be aware of memory hotplug.

- /proc/kcore will be architecture independent i.e.
if an arch supports CONFIG_MMU, it can use /proc/kcore.
(if the arch uses usual memory layout.)

This patch:

/proc/kcore uses its own list handling codes. It's better to use
generic list codes.

No changes in logic. just clean up.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:41 +0800
d899bf7b5 procfs: provide stack information for threads ... Browse Code »

A patch to give a better overview of the userland application stack usage,
especially for embedded linux.

Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.

There is an enhancement in the /proc//{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.

A sample output of /proc//task//maps looks like:

08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

Also there is a new entry "stack usage" in /proc//{task/*,}/status
which will you give the current stack usage in kb.

A sample output of /proc/self/status looks like:

Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB

I also fixed stack base address in /proc//{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.

[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-09-23 22:39:41 +0800
cba8aafe1 fs/proc/base.c: fix proc_fault_inject_write() input sanity check ... Browse Code »

Remove obfuscated zero-length input check and return -EINVAL instead of
-EIO error to make the error message clear to user. Add whitespace
stripping. No functionality changes.

The old code:

echo 1 > /proc/pid/make-it-fail (ok)
echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error)

The new code:

echo 1 > /proc/pid/make-it-fail (ok)
echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument)

This patch is conservative in changes to not breaking existing
scripts/applications.

Signed-off-by: Vincent Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vincent Li
2009-09-23 22:39:40 +0800
fb92a4b06 fs/proc/task_mmu.c v1: fix clear_refs_write() input sanity check ... Browse Code »

Andrew Morton pointed out similar string hacking and obfuscated check for
zero-length input at the end of the function, David Rientjes suggested to
use strict_strtol to replace simple_strtol, this patch cover above
suggestions, add removing of leading and trailing whitespace from user
input. It does not change function behavious.

Signed-off-by: Vincent Li
Acked-by: David Rientjes
Cc: Matt Mackall
Cc: Amerigo Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vincent Li
2009-09-23 22:39:40 +0800
acef82b87 kcore: fix /proc/kcore's stat.st_size ... Browse Code »

In 9063c61fd5cbd ("x86, 64-bit: Clean up user address masking") Linus
fixed the wrong size of /proc/kcore problem.

But its size still looks insane, since it never equals the size of
physical memory.

Signed-off-by: WANG Cong
Cc: "Eric W. Biederman"
Cc: Tao Ma
Cc:
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amerigo Wang
2009-09-23 22:39:40 +0800
9b4d1cbef proc_flush_task: flush /proc/tid/task/pid when a sub-thread exits ... Browse Code »

The exiting sub-thread flushes /proc/pid only, but this doesn't buy too
much: ps and friends mostly use /proc/tid/task/pid.

Remove "if (thread_group_leader())" checks from proc_flush_task() path,
this means we always remove /proc/tid/task/pid dentry on exit, and this
actually matches the comment above proc_flush_task().

The test-case:

static void* tfunc(void *arg)
{
char name[256];

sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid());
close(open(name, O_RDONLY));

return NULL;
}

int main(void)
{
pthread_t t;

for (;;) {
if (!pthread_create(&t, NULL, &tfunc, NULL))
pthread_join(t, NULL);
}
}

slabtop shows that pid/proc_inode_cache/etc grow quickly and
"indefinitely" until the task is killed or shrink_slab() is called, not
good. And the main thread needs a lot of time to exit.

The same can happen if something like "ps -efL" runs continuously, while
some application spawns short-living threads.

Reported-by: "James M. Leddy"
Signed-off-by: Oleg Nesterov
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Dominic Duval
Cc: Frank Hirtz
Cc: "Fuller, Johnray"
Cc: Larry Woodman
Cc: Paul Batkowski
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-09-23 22:39:40 +0800
cff4edb59 proc: fix reported unit for RLIMIT_CPU ... Browse Code »

/proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit
used in kernel/posix-cpu-timers.c:

unsigned long psecs = cputime_to_secs(ptime);
...
if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) {
...
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);

Signed-off-by: Kees Cook
Acked-by: WANG Cong
Acked-by: Neil Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2009-09-23 22:39:40 +0800
88e9d34c7 seq_file: constify seq_operations ... Browse Code »

Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.

This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.

Signed-off-by: James Morris
Acked-by: Serge Hallyn
Acked-by: Casey Schaufler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

James Morris
2009-09-23 22:39:29 +0800

22 Sep, 2009

3 commits

5d863b896 oom: fix oom_adjust_write() input sanity check ... Browse Code »

Andrew Morton pointed out oom_adjust_write() has very strange EIO
and new line handling. this patch fixes it.

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:39 +0800
495789a51 oom: make oom_score to per-process value ... Browse Code »

oom-killer kills a process, not task. Then oom_score should be calculated
as per-process too. it makes consistency more and makes speed up
select_bad_process().

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:39 +0800
28b83c519 oom: move oom_adj value from task_struct to signal_struct ... Browse Code »

Currently, OOM logic callflow is here.

__out_of_memory()
select_bad_process() for each task
badness() calculate badness of one task
oom_kill_process() search child
oom_kill_task() kill target task and mm shared tasks with it

example, process-A have two thread, thread-A and thread-B and it have very
fat memory and each thread have following oom_adj and oom_score.

thread-A: oom_adj = OOM_DISABLE, oom_score = 0
thread-B: oom_adj = 0, oom_score = very-high

Then, select_bad_process() select thread-B, but oom_kill_task() refuse
kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
call select_bad_process() again. but select_bad_process() select the same
task. It mean kernel fall in livelock.

The fact is, select_bad_process() must select killable task. otherwise
OOM logic go into livelock.

And root cause is, oom_adj shouldn't be per-thread value. it should be
per-process value because OOM-killer kill a process, not thread. Thus
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct signal_struct. it naturally prevent
select_bad_process() choose wrong task.

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:39 +0800