Doug / smarc-fsl-linux-kernel | Embedian Git Server

10 Aug, 2010

2 commits

51b1bd2ac oom: deprecate oom_adj tunable ... Browse Code »

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed. The target date for removal is August 2012.

A warning will be printed to the kernel log if a task attempts to use this
interface. Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.

Signed-off-by: David Rientjes
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-08-10 11:45:02 +0800
a63d83f42 oom: badness heuristic rewrite ... Browse Code »

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-08-10 11:45:02 +0800

04 Aug, 2010

1 commit

0ea6e6112 Documentation: update broken web addresses. ... Browse Code »

Below you will find an updated version from the original series bunching all patches into one big patch
updating broken web addresses that are located in Documentation/*
Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
Now there are also some addresses pointing to .spec files some are located, but some(after searching
on the companies site)where still no where to be found. In this case I just changed the address
to the company site this way the users can contact the company and they can locate them for the users.

Signed-off-by: Justin P. Mattock
Signed-off-by: Thomas Weber
Signed-off-by: Mike Frysinger
Cc: Paulo Marques
Cc: Randy Dunlap
Cc: Michael Neuling
Signed-off-by: Jiri Kosina

Justin P. Mattock
2010-08-04 21:21:40 +0800

21 May, 2010

1 commit

f39d01be4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...

Linus Torvalds
2010-05-21 00:20:59 +0800

20 May, 2010

1 commit

6e0b7b2c3 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip ... Browse Code »

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Clear CPU mask in affinity_hint when none is provided
genirq: Add CPU mask affinity hint
genirq: Remove IRQF_DISABLED from core code
genirq: Run irq handlers with interrupts disabled
genirq: Introduce request_any_context_irq()
genirq: Expose irq_desc->node in proc/irq

Fixed up trivial conflicts in Documentation/feature-removal-schedule.txt

Linus Torvalds
2010-05-20 08:09:40 +0800

12 May, 2010

1 commit

34441427a revert "procfs: provide stack information for threads" and its fixup commits ... Browse Code »

Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.

Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.

Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.

Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.

Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.

This patch nearly undoes the effect of all these patches.

The reason for reverting these is it provides an unusable value in
field 28. For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address. This unpredictability of the stack_start value makes
it worthless. That includes the intended use of showing how much stack
space a thread has.

Other architectures will get different values. As an example, ia64
gets 0. The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.

I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") . If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured. Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.

I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") . I left the KSTK_ESP() change in
place as that seemed worthwhile.

Signed-off-by: Robin Holt
Cc: Stefani Seibold
Cc: KOSAKI Motohiro
Cc: Michal Simek
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2010-05-12 08:33:41 +0800

23 Apr, 2010

1 commit

a33f32244 Documentation/: it's -> its where appropriate ... Browse Code »

Fix obvious cases of "it's" being used when "its" was meant.

Signed-off-by: Francis Galiegue
Acked-by: Randy Dunlap
Signed-off-by: Jiri Kosina

Francis Galiegue
2010-04-23 08:09:52 +0800

24 Mar, 2010

1 commit

92d6b71ab genirq: Expose irq_desc->node in proc/irq ... Browse Code »

Expose irq_desc->node as /proc/irq/*/node.

This file provides device hardware locality information for apps
desiring to include hardware locality in irq mapping decisions.

Signed-off-by: Dimitri Sivanich
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Thomas Gleixner

Dimitri Sivanich
2010-03-24 21:10:03 +0800

15 Mar, 2010

1 commit

32e688b8c Documentation/filesystems/proc.txt typo fix. ... Browse Code »

Typo fix for a filename in procfs documentation.

Signed-off-by: Rob Landley
Signed-off-by: Jiri Kosina

Rob Landley
2010-03-15 22:21:31 +0800

08 Mar, 2010

1 commit

318ae2edc Merge branch 'for-next' into for-linus ... Browse Code »

Conflicts:
Documentation/filesystems/proc.txt
arch/arm/mach-u300/include/mach/debug-macro.S
drivers/net/qlge/qlge_ethtool.c
drivers/net/qlge/qlge_main.c
drivers/net/typhoon.c

Jiri Kosina
2010-03-08 23:55:37 +0800

07 Mar, 2010

3 commits

a1b57ac06 mm: document /proc/pagetypeinfo ... Browse Code »

Add documentation for /proc/pagetypeinfo.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-03-07 03:26:26 +0800
b084d4353 mm: count swap usage ... Browse Code »

A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc//smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc//status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB
Reviewed-by: Minchan Kim
Reviewed-by: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800

18 Feb, 2010

1 commit

cb2992a60 doc: typo - Table 1-2 should refer to "status", not "statm" ... Browse Code »

Fixes a typo in proc.txt documentation. Table 1-2 should refer to
"status", not "statm"

Signed-off-by: Mulyadi Santosa
Signed-off-by: Jiri Kosina

Mulyadi Santosa
2010-02-18 07:43:50 +0800

12 Jan, 2010

1 commit

1306d603f proc: partially revert "procfs: provide stack information for threads" ... Browse Code »

Commit d899bf7b (procfs: provide stack information for threads) introduced
to show stack information in /proc/{pid}/status. But it cause large
performance regression. Unfortunately /proc/{pid}/status is used ps
command too and ps is one of most important component. Because both to
take mmap_sem and page table walk are heavily operation.

If many process run, the ps performance is,

[before d899bf7b]

% perf stat ps >/dev/null

Performance counter stats for 'ps':

4090.435806 task-clock-msecs # 0.032 CPUs
229 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
234 page-faults # 0.000 M/sec
8587565207 cycles # 2099.425 M/sec
9866662403 instructions # 1.149 IPC
3789415411 cache-references # 926.409 M/sec
30419509 cache-misses # 7.437 M/sec

128.859521955 seconds time elapsed

[after d899bf7b]

% perf stat ps > /dev/null

Performance counter stats for 'ps':

4305.081146 task-clock-msecs # 0.028 CPUs
480 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
237 page-faults # 0.000 M/sec
9021211334 cycles # 2095.480 M/sec
10605887536 instructions # 1.176 IPC
3612650999 cache-references # 839.160 M/sec
23917502 cache-misses # 5.556 M/sec

152.277819582 seconds time elapsed

Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
provide almost same information. we can use it.

Commit d899bf7b introduced two features:

1) Add the annotattion of [thread stack: xxxx] mark to
/proc/{pid}/task/{tid}/maps.
2) Add StackUsage field to /proc/{pid}/status.

I only revert (2), because I haven't seen (1) cause regression.

Signed-off-by: KOSAKI Motohiro
Cc: Stefani Seibold
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Randy Dunlap
Cc: Andrew Morton
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-01-12 01:34:06 +0800

16 Dec, 2009

1 commit

4614a696b procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm ... Browse Code »

Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications. However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.

However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves. This sort of behavior is
available on other systems via the pthread_setname_np() interface.

This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz
Cc: Andi Kleen
Cc: Arjan van de Ven
Cc: Mike Fulton
Cc: Sean Foley
Cc: Darren Hart
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

john stultz
2009-12-16 00:53:24 +0800

10 Dec, 2009

1 commit

e3cc2226e Doc: better explanation of procs_running ... Browse Code »

the description in Documentation/filesystems/proc.txt of the
procs_running entry in /proc/stat is confusing (according to that
description, it looks as if procs_running could only be a number
between 0 and the number of CPUs).

Changed it to a more accurate description in the patch attached.

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Luis Garces-Erice
2009-12-10 10:59:52 +0800

26 Oct, 2009

1 commit

ce0e7b28f sched, cpuacct: Fix niced guest time accounting ... Browse Code »

CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.

This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.

And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.

The original discussions can be found here:

http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html
http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html

Signed-off-by: Ryota Ozaki
Acked-by: Avi Kivity
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Ryota Ozaki
2009-10-26 00:31:30 +0800

30 Sep, 2009

1 commit

296c355cd ext4: Use tracepoints for mb_history trace file ... Browse Code »

The /proc/fs/ext4//mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.

By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.

Signed-off-by: "Theodore Ts'o"

Theodore Ts'o
2009-09-30 12:32:42 +0800

23 Sep, 2009

1 commit

d899bf7b5 procfs: provide stack information for threads ... Browse Code »

A patch to give a better overview of the userland application stack usage,
especially for embedded linux.

Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.

There is an enhancement in the /proc//{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.

A sample output of /proc//task//maps looks like:

08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

Also there is a new entry "stack usage" in /proc//{task/*,}/status
which will you give the current stack usage in kb.

A sample output of /proc/self/status looks like:

Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB

I also fixed stack base address in /proc//{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.

[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-09-23 22:39:41 +0800

22 Sep, 2009

3 commits

495789a51 oom: make oom_score to per-process value ... Browse Code »

oom-killer kills a process, not task. Then oom_score should be calculated
as per-process too. it makes consistency more and makes speed up
select_bad_process().

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:39 +0800
398499d5f pagemap clear_refs: modify to specify anon or mapped vma clearing ... Browse Code »

The patch makes the clear_refs more versatile in adding the option to
select anonymous pages or file backed pages for clearing. This addition
has a measurable impact on user space application performance as it
decreases the number of pagewalks in scenarios where one is only
interested in a specific type of page (anonymous or file mapped).

The patch adds anonymous and file backed filters to the clear_refs interface.

echo 1 > /proc/PID/clear_refs resets the bits on all pages
echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only

Any other value is ignored

Signed-off-by: Moussa A. Ba
Signed-off-by: Jared E. Hulbert
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Moussa A. Ba
2009-09-22 22:17:33 +0800
c574358e8 proc: document `guest' column in /proc/stat ... Browse Code »

We added a new column in cpuX lines of /proc/stat, to show the amount of
time spent by a cpu servicing a guest, without updating
Documentation/filesystems/proc.txt

Signed-off-by: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2009-09-22 22:17:24 +0800

19 Aug, 2009

1 commit

0753ba01e mm: revert "oom: move oom_adj value" ... Browse Code »

The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct. It was a very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job
scheduler. Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

...
set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
...
if (vfork() == 0) {
set_oom_adj(0); /* Invoked child can be killed */
execve("foo-bar-cmd");
}
....

vfork() parent and child are shared the same mm_struct. then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.

Then, this patch revert commit 2ff05b2b and related commit.

Reverted commit list
---------------------
- commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b57 (mm: copy over oom_adj value at fork time)

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Nick Piggin
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-08-19 07:31:13 +0800

19 Jun, 2009

2 commits

349888ee7 proc.txt: update kernel filesystem/proc.txt documentation ... Browse Code »

An update for the "Process-Specific Subdirectories" section to reflect the
changes till kernel 2.6.30.

Signed-off-by: Stefani Seibold
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-06-19 04:03:41 +0800
d3d64df21 proc: export statistics for softirq to /proc ... Browse Code »

Export statistics for softirq in /proc/softirqs and /proc/stat.

1. /proc/softirqs
Implement /proc/softirqs which shows the number of softirq
for each CPU like /proc/interrupts.

2. /proc/stat
Add the "softirq" line to /proc/stat.
This line shows the number of softirq for all cpu.
The first column is the total of all softirqs and
each subsequent column is the total for particular softirq.

[kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
Signed-off-by: Keika Kobayashi
Reviewed-by: Hiroshi Shimamoto
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: Eric Dumazet
Cc: Alexey Dobriyan
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Keika Kobayashi
2009-06-19 04:03:41 +0800

17 Jun, 2009

1 commit

2ff05b2b4 oom: move oom_adj value from task_struct to mm_struct ... Browse Code »

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm. If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct. This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already. They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-06-17 10:47:43 +0800

13 Jun, 2009

1 commit

19f594600 trivial: Miscellaneous documentation typo fixes ... Browse Code »

Fix various typos in documentation txts.

Signed-off-by: Matt LaPlante
Signed-off-by: Jiri Kosina

Matt LaPlante
2009-06-13 00:01:47 +0800

03 Apr, 2009

1 commit

760df93ec documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls ... Browse Code »

Now /proc/sys is described in many places and much information is
redundant. This patch updates the proc.txt and move the /proc/sys
desciption out to the files in Documentation/sysctls.

Details are:

merge
- 2.1 /proc/sys/fs - File system data
- 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
- 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
with Documentation/sysctls/fs.txt.

remove
- 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
since it's not better then the Documentation/binfmt_misc.txt.

merge
- 2.3 /proc/sys/kernel - general kernel parameters
with Documentation/sysctls/kernel.txt

remove
- 2.5 /proc/sys/dev - Device specific parameters
since it's obsolete the sysfs is used now.

remove
- 2.6 /proc/sys/sunrpc - Remote procedure calls
since it's not better then the Documentation/sysctls/sunrpc.txt

move
- 2.7 /proc/sys/net - Networking stuff
- 2.9 Appletalk
- 2.10 IPX
to newly created Documentation/sysctls/net.txt.

remove
- 2.8 /proc/sys/net/ipv4 - IPV4 settings
since it's not better then the Documentation/networking/ip-sysctl.txt.

add
- Chapter 3 Per-Process Parameters
to descibe /proc//xxx parameters.

Signed-off-by: Shen Feng
Cc: Randy Dunlap
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shen Feng
2009-04-03 10:04:53 +0800

31 Mar, 2009

1 commit

b713a5ec5 ext4: remove /proc tuning knobs ... Browse Code »

Remove tuning knobs in /proc/fs/ext4//*.

Signed-off-by: "Theodore Ts'o"

Theodore Ts'o
2009-03-31 21:11:14 +0800

19 Mar, 2009

1 commit

e9c6a586f net: Document /proc/sys/net/core/netdev_budget ... Browse Code »

The NAPI poll parameter netdev_budget is not documented in
kernel-docs. Since it may have a substantial effect on at least some
network loads, it should be.

Signed-off-by: Stanislaw Gruszka
Signed-off-by: David S. Miller

Stanislaw Gruszka
2009-03-19 09:51:06 +0800

30 Jan, 2009

1 commit

9e9e3cbc6 mm: OOM documentation update ... Browse Code »

Signed-off-by: Evgeniy Polyakov
Acked-by: David Rientjes
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-30 10:04:43 +0800

16 Jan, 2009

1 commit

db0fb1848 Update of Documentation: vm.txt and proc.txt ... Browse Code »

Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
More specifically, the section on /proc/sys/vm in
Documentation/filesystems/proc.txt was removed and a link to
Documentation/sysctl/vm.txt added.

Most of the verbiage from proc.txt was simply moved in vm.txt, with new
addtional text for "swappiness" and "stat_interval".

Signed-off-by: Peter W Morreale
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter W Morreale
2009-01-16 08:39:35 +0800

08 Jan, 2009

1 commit

a0c9f240a Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc ... Browse Code »

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
proc: remove write-only variable in proc_pident_lookup()
proc: fix sparse warning
proc: add /proc/*/stack
proc: remove '##' usage
proc: remove useless WARN_ONs
proc: stop using BKL

Linus Torvalds
2009-01-08 04:01:06 +0800

07 Jan, 2009

1 commit

2da02997e mm: add dirty_background_bytes and dirty_bytes sysctls ... Browse Code »

This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.

dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.

With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.

dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.

When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:

dirtyable_memory = free pages + mapped pages + file cache

dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory

AND

dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory

Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.

The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.

Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.

Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.

Acked-by: Peter Zijlstra
Cc: Dave Chinner
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Cc: Andrea Righi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-01-07 07:59:03 +0800

05 Jan, 2009

1 commit

2ec220e27 proc: add /proc/*/stack ... Browse Code »

/proc/*/stack adds the ability to query a task's stack trace. It is more
useful than /proc/*/wchan as it provides full stack trace instead of single
depth. Example output:

$ cat /proc/self/stack
[] save_stack_trace_tsk+0x17/0x35
[] proc_pid_stack+0x4a/0x76
[] proc_single_show+0x4a/0x5e
[] seq_read+0xf3/0x29f
[] vfs_read+0x6d/0x91
[] sys_read+0x3b/0x60
[] syscall_call+0x7/0xb
[] 0xffffffff

[add save_stack_trace_tsk() on mips, ACK Ralf --adobriyan]
Signed-off-by: Ken Chen
Signed-off-by: Ingo Molnar
Signed-off-by: Alexey Dobriyan

Ken Chen
2009-01-05 17:27:44 +0800

23 Dec, 2008

1 commit

fa623d1b0 Merge branches 'x86/apic', 'x86/cleanups', 'x86/cpufeature', 'x86/crashdump', 'x… ... Browse Code »

…86/debug', 'x86/defconfig', 'x86/detect-hyper', 'x86/doc', 'x86/dumpstack', 'x86/early-printk', 'x86/fpu', 'x86/idle', 'x86/io', 'x86/memory-corruption-check', 'x86/microcode', 'x86/mm', 'x86/mtrr', 'x86/nmi-watchdog', 'x86/pat2', 'x86/pci-ioapic-boot-irq-quirks', 'x86/ptrace', 'x86/quirks', 'x86/reboot', 'x86/setup-memory', 'x86/signal', 'x86/sparse-fixes', 'x86/time', 'x86/uv' and 'x86/xen' into x86/core

Ingo Molnar
2008-12-23 23:27:23 +0800

02 Dec, 2008

1 commit

7ef9964e6 epoll: introduce resource usage limits ... Browse Code »

It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:

max_user_instances = Maximum number of devices - per user

max_user_watches = Maximum number of "watched" fds - per user

The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.

This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.

[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi
Cc: Michael Kerrisk
Cc:
Cc: Cyrill Gorcunov
Reported-by: Vegard Nossum
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davide Libenzi
2008-12-02 11:55:24 +0800

31 Oct, 2008

1 commit

8a1c8eb75 x86, nmi-watchdog: update procfs nmi_watchdog file documentation v2 ... Browse Code »

Impact: improve documentation

This patch updates the /proc/sys/kernel/nmi_watchdog documentation.
Updated: included Randy Dunlap's corrections.

Signed-off-by: Aristeu Rozanski
Acked-by: Randy Dunlap
Signed-off-by: Ingo Molnar

Aristeu Rozanski
2008-10-31 02:07:04 +0800

20 Oct, 2008

1 commit

7a6560e02 documentation: clarify dirty_ratio and dirty_background_ratio description ... Browse Code »

The current documentation of dirty_ratio and dirty_background_ratio is a
bit misleading.

In the documentation we say that they are "a percentage of total system
memory", but the current page writeback policy, intead, is to apply the
percentages to the dirtyable memory, that means free pages + reclaimable
pages.

Better to be more explicit to clarify this concept.

Signed-off-by: Andrea Righi
Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Righi
2008-10-20 23:52:32 +0800