Eric Lee / smarc-fsl-linux-kernel

15 Jun, 2011

1 commit

09223371d rcu: Use softirq to address performance regression ... Browse Code »

Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread)
introduced performance regression. In an AIM7 test, this commit degraded
performance by about 40%.

The commit runs rcu callbacks in a kthread instead of softirq. We observed
high rate of context switch which is caused by this. Out test system has
64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
which is caused by RCU's per-CPU kthread. A trace showed that most of
the time the RCU per-CPU kthread doesn't actually handle any callbacks,
but instead just does a very small amount of work handling grace periods.
This means that RCU's per-CPU kthreads are making the scheduler do quite
a bit of work in order to allow a very small amount of RCU-related
processing to be done.

Alex Shi's analysis determined that this slowdown is due to lock
contention within the scheduler. Unfortunately, as Peter Zijlstra points
out, the scheduler's real-time semantics require global action, which
means that this contention is inherent in real-time scheduling. (Yes,
perhaps someone will come up with a workaround -- otherwise, -rt is not
going to do well on large SMP systems -- but this patch will work around
this issue in the meantime. And "the meantime" might well be forever.)

This patch therefore re-introduces softirq processing to RCU, but only
for core RCU work. RCU callbacks are still executed in kthread context,
so that only a small amount of RCU work runs in softirq context in the
common case. This should minimize ksoftirqd execution, allowing us to
skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.

Signed-off-by: Shaohua Li
Tested-by: "Alex,Shi"
Signed-off-by: Paul E. McKenney

Shaohua Li
2011-06-15 06:25:39 +0800

25 May, 2011

1 commit

4b060420a bitmap, irq: add smp_affinity_list interface to /proc/irq ... Browse Code »

Manually adjusting the smp_affinity for IRQ's becomes unwieldy when the
cpu count is large.

Setting smp affinity to cpus 256 to 263 would be:

echo 000000ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > smp_affinity

instead of:

echo 256-263 > smp_affinity_list

Think about what it looks like for cpus around say, 4088 to 4095.

We already have many alternate "list" interfaces:

/sys/devices/system/cpu/cpuX/indexY/shared_cpu_list
/sys/devices/system/cpu/cpuX/topology/thread_siblings_list
/sys/devices/system/cpu/cpuX/topology/core_siblings_list
/sys/devices/system/node/nodeX/cpulist
/sys/devices/pci***/***/local_cpulist

Add a companion interface, smp_affinity_list to use cpu lists instead of
cpu maps. This conforms to other companion interfaces where both a map
and a list interface exists.

This required adding a bitmap_parselist_user() function in a manner
similar to the bitmap_parse_user() function.

[akpm@linux-foundation.org: make __bitmap_parselist() static]
Signed-off-by: Mike Travis
Cc: Thomas Gleixner
Cc: Jack Steiner
Cc: Lee Schermerhorn
Cc: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Travis
2011-05-25 23:39:45 +0800

06 May, 2011

1 commit

a26ac2455 rcu: move TREE_RCU from softirq to kthread ... Browse Code »

If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers. Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked. If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.

But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.

Also add comments and properly synchronized all accesses to
rcu_cpu_kthread_task, as suggested by Lai Jiangshan.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-05-06 14:16:54 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

14 Jan, 2011

2 commits

dabb16f63 oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down ... Browse Code »

We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.

This patch updated from original patch based on feedback from David
Rientjes.

Signed-off-by: Mandeep Singh Baines
Acked-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2011-01-14 09:32:35 +0800
2d90508f6 mm: smaps: export mlock information ... Browse Code »

Currently there is no way to find whether a process has locked its pages
in memory or not. And which of the memory regions are locked in memory.

Add a new field "Locked" to export this information via the smaps file.

Signed-off-by: Nikanth Karthikesan
Acked-by: Balbir Singh
Acked-by: Wu Fengguang
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikanth Karthikesan
2011-01-14 09:32:33 +0800

17 Nov, 2010

1 commit

23308ba54 console: add /proc/consoles ... Browse Code »

It allows users to see what consoles are currently known to the system
and with what flags.

It is based on Werner's patch, the part about traversing fds was
removed, the code was moved to kernel/printk.c, where consoles are
handled and it makes more sense to me.

Signed-off-by: Jiri Slaby [cleanups]
Signed-off-by: "Dr. Werner Fink"
Cc: Al Viro
Cc: Greg Kroah-Hartman
Signed-off-by: Greg Kroah-Hartman

Jiri Slaby
2010-11-17 04:50:17 +0800

28 Oct, 2010

2 commits

03f890f8c /proc/pid/pagemap: document in Documentation/filesystems/proc.txt ... Browse Code »

Document /proc/pid/pagemap in Documentation/filesystems/proc.txt

Signed-off-by: Nikanth Karthikesan
Cc: Richard Guenther
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Acked-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikanth Karthikesan
2010-10-28 09:03:13 +0800
b40d4f84b /proc/pid/smaps: export amount of anonymous memory in a mapping ... Browse Code »

Export the number of anonymous pages in a mapping via smaps.

Even the private pages in a mapping backed by a file, would be marked as
anonymous, when they are modified. Export this information to user-space via
smaps.

Exporting this count will help gdb to make a better decision on which
areas need to be dumped in its coredump; and should be useful to others
studying the memory usage of a process.

Signed-off-by: Nikanth Karthikesan
Acked-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikanth Karthikesan
2010-10-28 09:03:13 +0800

27 Oct, 2010

1 commit

0f4d208f1 Documentation/filesystems/proc.txt: improve smaps field documentation ... Browse Code »

Signed-off-by: Matt Mackall
Cc: Nikanth Karthikesan
Cc: Balbir Singh
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2010-10-27 07:52:05 +0800

23 Oct, 2010

2 commits

6c2754c28 Revert "tty: Add a new file /proc/tty/consoles" ... Browse Code »

This reverts commit f4a3e0bceb57466c31757f25e4e0ed108d1299ec. Jiri
Sladby points out that the tty structure we're using may already be
gone, and Al Viro doesn't hold back in complaining about the random
loading of 'filp->private_data' which doesn't have to be a pointer at
all, nor does checking the magic field for TTY_MAGIC prove anything.

Belated review by Al:

"a) global variable depending on stdin of the last opener? Affecting
output of read(2)? Really?

b) iterator is broken; list should be locked in ->start(), unlocked in
->stop() and *NOT* unlocked/relocked in ->next()

c) ->show() ought to do nothing in case of ->device == NULL, instead
of skipping those in ->next()/->start()

d) regardless of the merits of the bright idea about asterisk at that
line in output *and* regardless of (a), the implementation is not
only atrociously ugly, it's actually very likely to be a roothole.
Verifying that Cthulhu knows what number happens to be address of a
tty_struct by blindly dereferencing memory at that address...
Ouch.

Please revert that crap."

And Christoph pipes in and NAK's the approach of walking fd tables etc
too. So it's pretty unanimous.

Noticed-by: Jri Slaby
Requested-by: Al Viro
Cc: Greg Kroah-Hartman
Cc: Werner Fink
Cc: Alan Cox
Cc: Christoph Hellwig
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-10-23 23:14:12 +0800
f4a3e0bce tty: Add a new file /proc/tty/consoles ... Browse Code »

Add a new file /proc/tty/consoles to be able to determine the registered
system console lines. If the reading process holds /dev/console open at
the regular standard input stream the active device will be marked by an
asterisk. Show possible operations and also decode the used flags of
the listed console lines.

Signed-off-by: Werner Fink
Cc: Alan Cox
Signed-off-by: Greg Kroah-Hartman

Dr. Werner Fink
2010-10-23 01:20:05 +0800

10 Aug, 2010

2 commits

51b1bd2ac oom: deprecate oom_adj tunable ... Browse Code »

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed. The target date for removal is August 2012.

A warning will be printed to the kernel log if a task attempts to use this
interface. Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.

Signed-off-by: David Rientjes
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-08-10 11:45:02 +0800
a63d83f42 oom: badness heuristic rewrite ... Browse Code »

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-08-10 11:45:02 +0800

04 Aug, 2010

1 commit

0ea6e6112 Documentation: update broken web addresses. ... Browse Code »

Below you will find an updated version from the original series bunching all patches into one big patch
updating broken web addresses that are located in Documentation/*
Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
Now there are also some addresses pointing to .spec files some are located, but some(after searching
on the companies site)where still no where to be found. In this case I just changed the address
to the company site this way the users can contact the company and they can locate them for the users.

Signed-off-by: Justin P. Mattock
Signed-off-by: Thomas Weber
Signed-off-by: Mike Frysinger
Cc: Paulo Marques
Cc: Randy Dunlap
Cc: Michael Neuling
Signed-off-by: Jiri Kosina

Justin P. Mattock
2010-08-04 21:21:40 +0800

21 May, 2010

1 commit

f39d01be4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...

Linus Torvalds
2010-05-21 00:20:59 +0800

20 May, 2010

1 commit

6e0b7b2c3 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip ... Browse Code »

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Clear CPU mask in affinity_hint when none is provided
genirq: Add CPU mask affinity hint
genirq: Remove IRQF_DISABLED from core code
genirq: Run irq handlers with interrupts disabled
genirq: Introduce request_any_context_irq()
genirq: Expose irq_desc->node in proc/irq

Fixed up trivial conflicts in Documentation/feature-removal-schedule.txt

Linus Torvalds
2010-05-20 08:09:40 +0800

12 May, 2010

1 commit

34441427a revert "procfs: provide stack information for threads" and its fixup commits ... Browse Code »

Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.

Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.

Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.

Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.

Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.

This patch nearly undoes the effect of all these patches.

The reason for reverting these is it provides an unusable value in
field 28. For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address. This unpredictability of the stack_start value makes
it worthless. That includes the intended use of showing how much stack
space a thread has.

Other architectures will get different values. As an example, ia64
gets 0. The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.

I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") . If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured. Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.

I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") . I left the KSTK_ESP() change in
place as that seemed worthwhile.

Signed-off-by: Robin Holt
Cc: Stefani Seibold
Cc: KOSAKI Motohiro
Cc: Michal Simek
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2010-05-12 08:33:41 +0800

23 Apr, 2010

1 commit

a33f32244 Documentation/: it's -> its where appropriate ... Browse Code »

Fix obvious cases of "it's" being used when "its" was meant.

Signed-off-by: Francis Galiegue
Acked-by: Randy Dunlap
Signed-off-by: Jiri Kosina

Francis Galiegue
2010-04-23 08:09:52 +0800

24 Mar, 2010

1 commit

92d6b71ab genirq: Expose irq_desc->node in proc/irq ... Browse Code »

Expose irq_desc->node as /proc/irq/*/node.

This file provides device hardware locality information for apps
desiring to include hardware locality in irq mapping decisions.

Signed-off-by: Dimitri Sivanich
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Thomas Gleixner

Dimitri Sivanich
2010-03-24 21:10:03 +0800

15 Mar, 2010

1 commit

32e688b8c Documentation/filesystems/proc.txt typo fix. ... Browse Code »

Typo fix for a filename in procfs documentation.

Signed-off-by: Rob Landley
Signed-off-by: Jiri Kosina

Rob Landley
2010-03-15 22:21:31 +0800

08 Mar, 2010

1 commit

318ae2edc Merge branch 'for-next' into for-linus ... Browse Code »

Conflicts:
Documentation/filesystems/proc.txt
arch/arm/mach-u300/include/mach/debug-macro.S
drivers/net/qlge/qlge_ethtool.c
drivers/net/qlge/qlge_main.c
drivers/net/typhoon.c

Jiri Kosina
2010-03-08 23:55:37 +0800

07 Mar, 2010

3 commits

a1b57ac06 mm: document /proc/pagetypeinfo ... Browse Code »

Add documentation for /proc/pagetypeinfo.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-03-07 03:26:26 +0800
b084d4353 mm: count swap usage ... Browse Code »

A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc//smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc//status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB
Reviewed-by: Minchan Kim
Reviewed-by: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800

18 Feb, 2010

1 commit

cb2992a60 doc: typo - Table 1-2 should refer to "status", not "statm" ... Browse Code »

Fixes a typo in proc.txt documentation. Table 1-2 should refer to
"status", not "statm"

Signed-off-by: Mulyadi Santosa
Signed-off-by: Jiri Kosina

Mulyadi Santosa
2010-02-18 07:43:50 +0800

12 Jan, 2010

1 commit

1306d603f proc: partially revert "procfs: provide stack information for threads" ... Browse Code »

Commit d899bf7b (procfs: provide stack information for threads) introduced
to show stack information in /proc/{pid}/status. But it cause large
performance regression. Unfortunately /proc/{pid}/status is used ps
command too and ps is one of most important component. Because both to
take mmap_sem and page table walk are heavily operation.

If many process run, the ps performance is,

[before d899bf7b]

% perf stat ps >/dev/null

Performance counter stats for 'ps':

4090.435806 task-clock-msecs # 0.032 CPUs
229 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
234 page-faults # 0.000 M/sec
8587565207 cycles # 2099.425 M/sec
9866662403 instructions # 1.149 IPC
3789415411 cache-references # 926.409 M/sec
30419509 cache-misses # 7.437 M/sec

128.859521955 seconds time elapsed

[after d899bf7b]

% perf stat ps > /dev/null

Performance counter stats for 'ps':

4305.081146 task-clock-msecs # 0.028 CPUs
480 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
237 page-faults # 0.000 M/sec
9021211334 cycles # 2095.480 M/sec
10605887536 instructions # 1.176 IPC
3612650999 cache-references # 839.160 M/sec
23917502 cache-misses # 5.556 M/sec

152.277819582 seconds time elapsed

Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
provide almost same information. we can use it.

Commit d899bf7b introduced two features:

1) Add the annotattion of [thread stack: xxxx] mark to
/proc/{pid}/task/{tid}/maps.
2) Add StackUsage field to /proc/{pid}/status.

I only revert (2), because I haven't seen (1) cause regression.

Signed-off-by: KOSAKI Motohiro
Cc: Stefani Seibold
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Randy Dunlap
Cc: Andrew Morton
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-01-12 01:34:06 +0800

16 Dec, 2009

1 commit

4614a696b procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm ... Browse Code »

Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications. However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.

However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves. This sort of behavior is
available on other systems via the pthread_setname_np() interface.

This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz
Cc: Andi Kleen
Cc: Arjan van de Ven
Cc: Mike Fulton
Cc: Sean Foley
Cc: Darren Hart
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

john stultz
2009-12-16 00:53:24 +0800

10 Dec, 2009

1 commit

e3cc2226e Doc: better explanation of procs_running ... Browse Code »

the description in Documentation/filesystems/proc.txt of the
procs_running entry in /proc/stat is confusing (according to that
description, it looks as if procs_running could only be a number
between 0 and the number of CPUs).

Changed it to a more accurate description in the patch attached.

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Luis Garces-Erice
2009-12-10 10:59:52 +0800

26 Oct, 2009

1 commit

ce0e7b28f sched, cpuacct: Fix niced guest time accounting ... Browse Code »

CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.

This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.

And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.

The original discussions can be found here:

http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html
http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html

Signed-off-by: Ryota Ozaki
Acked-by: Avi Kivity
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Ryota Ozaki
2009-10-26 00:31:30 +0800

30 Sep, 2009

1 commit

296c355cd ext4: Use tracepoints for mb_history trace file ... Browse Code »

The /proc/fs/ext4//mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.

By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.

Signed-off-by: "Theodore Ts'o"

Theodore Ts'o
2009-09-30 12:32:42 +0800

23 Sep, 2009

1 commit

d899bf7b5 procfs: provide stack information for threads ... Browse Code »

A patch to give a better overview of the userland application stack usage,
especially for embedded linux.

Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.

There is an enhancement in the /proc//{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.

A sample output of /proc//task//maps looks like:

08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

Also there is a new entry "stack usage" in /proc//{task/*,}/status
which will you give the current stack usage in kb.

A sample output of /proc/self/status looks like:

Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB

I also fixed stack base address in /proc//{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.

[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-09-23 22:39:41 +0800

22 Sep, 2009

3 commits

495789a51 oom: make oom_score to per-process value ... Browse Code »

oom-killer kills a process, not task. Then oom_score should be calculated
as per-process too. it makes consistency more and makes speed up
select_bad_process().

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:39 +0800
398499d5f pagemap clear_refs: modify to specify anon or mapped vma clearing ... Browse Code »

The patch makes the clear_refs more versatile in adding the option to
select anonymous pages or file backed pages for clearing. This addition
has a measurable impact on user space application performance as it
decreases the number of pagewalks in scenarios where one is only
interested in a specific type of page (anonymous or file mapped).

The patch adds anonymous and file backed filters to the clear_refs interface.

echo 1 > /proc/PID/clear_refs resets the bits on all pages
echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only

Any other value is ignored

Signed-off-by: Moussa A. Ba
Signed-off-by: Jared E. Hulbert
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Moussa A. Ba
2009-09-22 22:17:33 +0800
c574358e8 proc: document `guest' column in /proc/stat ... Browse Code »

We added a new column in cpuX lines of /proc/stat, to show the amount of
time spent by a cpu servicing a guest, without updating
Documentation/filesystems/proc.txt

Signed-off-by: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2009-09-22 22:17:24 +0800

19 Aug, 2009

1 commit

0753ba01e mm: revert "oom: move oom_adj value" ... Browse Code »

The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct. It was a very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job
scheduler. Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

...
set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
...
if (vfork() == 0) {
set_oom_adj(0); /* Invoked child can be killed */
execve("foo-bar-cmd");
}
....

vfork() parent and child are shared the same mm_struct. then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.

Then, this patch revert commit 2ff05b2b and related commit.

Reverted commit list
---------------------
- commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b57 (mm: copy over oom_adj value at fork time)

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Nick Piggin
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-08-19 07:31:13 +0800

19 Jun, 2009

2 commits

349888ee7 proc.txt: update kernel filesystem/proc.txt documentation ... Browse Code »

An update for the "Process-Specific Subdirectories" section to reflect the
changes till kernel 2.6.30.

Signed-off-by: Stefani Seibold
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-06-19 04:03:41 +0800
d3d64df21 proc: export statistics for softirq to /proc ... Browse Code »

Export statistics for softirq in /proc/softirqs and /proc/stat.

1. /proc/softirqs
Implement /proc/softirqs which shows the number of softirq
for each CPU like /proc/interrupts.

2. /proc/stat
Add the "softirq" line to /proc/stat.
This line shows the number of softirq for all cpu.
The first column is the total of all softirqs and
each subsequent column is the total for particular softirq.

[kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
Signed-off-by: Keika Kobayashi
Reviewed-by: Hiroshi Shimamoto
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: Eric Dumazet
Cc: Alexey Dobriyan
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Keika Kobayashi
2009-06-19 04:03:41 +0800

17 Jun, 2009

1 commit

2ff05b2b4 oom: move oom_adj value from task_struct to mm_struct ... Browse Code »

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm. If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct. This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already. They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-06-17 10:47:43 +0800

13 Jun, 2009

1 commit

19f594600 trivial: Miscellaneous documentation typo fixes ... Browse Code »

Fix various typos in documentation txts.

Signed-off-by: Matt LaPlante
Signed-off-by: Jiri Kosina

Matt LaPlante
2009-06-13 00:01:47 +0800