10 Aug, 2010

2 commits

  • /proc/pid/oom_adj is now deprecated so that that it may eventually be
    removed. The target date for removal is August 2012.

    A warning will be printed to the kernel log if a task attempts to use this
    interface. Future warning will be suppressed until the kernel is rebooted
    to prevent spamming the kernel log.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Aug, 2010

1 commit

  • Below you will find an updated version from the original series bunching all patches into one big patch
    updating broken web addresses that are located in Documentation/*
    Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
    the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
    Now there are also some addresses pointing to .spec files some are located, but some(after searching
    on the companies site)where still no where to be found. In this case I just changed the address
    to the company site this way the users can contact the company and they can locate them for the users.

    Signed-off-by: Justin P. Mattock
    Signed-off-by: Thomas Weber
    Signed-off-by: Mike Frysinger
    Cc: Paulo Marques
    Cc: Randy Dunlap
    Cc: Michael Neuling
    Signed-off-by: Jiri Kosina

    Justin P. Mattock
     

21 May, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
    vlynq: make whole Kconfig-menu dependant on architecture
    add descriptive comment for TIF_MEMDIE task flag declaration.
    EEPROM: max6875: Header file cleanup
    EEPROM: 93cx6: Header file cleanup
    EEPROM: Header file cleanup
    agp: use NULL instead of 0 when pointer is needed
    rtc-v3020: make bitfield unsigned
    PCI: make bitfield unsigned
    jbd2: use NULL instead of 0 when pointer is needed
    cciss: fix shadows sparse warning
    doc: inode uses a mutex instead of a semaphore.
    uml: i386: Avoid redefinition of NR_syscalls
    fix "seperate" typos in comments
    cocbalt_lcdfb: correct sections
    doc: Change urls for sparse
    Powerpc: wii: Fix typo in comment
    i2o: cleanup some exit paths
    Documentation/: it's -> its where appropriate
    UML: Fix compiler warning due to missing task_struct declaration
    UML: add kernel.h include to signal.c
    ...

    Linus Torvalds
     

20 May, 2010

1 commit


12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

23 Apr, 2010

1 commit


24 Mar, 2010

1 commit

  • Expose irq_desc->node as /proc/irq/*/node.

    This file provides device hardware locality information for apps
    desiring to include hardware locality in irq mapping decisions.

    Signed-off-by: Dimitri Sivanich
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    Dimitri Sivanich
     

15 Mar, 2010

1 commit


08 Mar, 2010

1 commit


07 Mar, 2010

3 commits

  • Add documentation for /proc/pagetypeinfo.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

18 Feb, 2010

1 commit


12 Jan, 2010

1 commit

  • Commit d899bf7b (procfs: provide stack information for threads) introduced
    to show stack information in /proc/{pid}/status. But it cause large
    performance regression. Unfortunately /proc/{pid}/status is used ps
    command too and ps is one of most important component. Because both to
    take mmap_sem and page table walk are heavily operation.

    If many process run, the ps performance is,

    [before d899bf7b]

    % perf stat ps >/dev/null

    Performance counter stats for 'ps':

    4090.435806 task-clock-msecs # 0.032 CPUs
    229 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    234 page-faults # 0.000 M/sec
    8587565207 cycles # 2099.425 M/sec
    9866662403 instructions # 1.149 IPC
    3789415411 cache-references # 926.409 M/sec
    30419509 cache-misses # 7.437 M/sec

    128.859521955 seconds time elapsed

    [after d899bf7b]

    % perf stat ps > /dev/null

    Performance counter stats for 'ps':

    4305.081146 task-clock-msecs # 0.028 CPUs
    480 context-switches # 0.000 M/sec
    2 CPU-migrations # 0.000 M/sec
    237 page-faults # 0.000 M/sec
    9021211334 cycles # 2095.480 M/sec
    10605887536 instructions # 1.176 IPC
    3612650999 cache-references # 839.160 M/sec
    23917502 cache-misses # 5.556 M/sec

    152.277819582 seconds time elapsed

    Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
    provide almost same information. we can use it.

    Commit d899bf7b introduced two features:

    1) Add the annotattion of [thread stack: xxxx] mark to
    /proc/{pid}/task/{tid}/maps.
    2) Add StackUsage field to /proc/{pid}/status.

    I only revert (2), because I haven't seen (1) cause regression.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Cc: Andrew Morton
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Dec, 2009

1 commit

  • Setting a thread's comm to be something unique is a very useful ability
    and is helpful for debugging complicated threaded applications. However
    currently the only way to set a thread name is for the thread to name
    itself via the PR_SET_NAME prctl.

    However, there may be situations where it would be advantageous for a
    thread dispatcher to be naming the threads its managing, rather then
    having the threads self-describe themselves. This sort of behavior is
    available on other systems via the pthread_setname_np() interface.

    This patch exports a task's comm via proc/pid/comm and
    proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
    these values.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: John Stultz
    Cc: Andi Kleen
    Cc: Arjan van de Ven
    Cc: Mike Fulton
    Cc: Sean Foley
    Cc: Darren Hart
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     

10 Dec, 2009

1 commit

  • the description in Documentation/filesystems/proc.txt of the
    procs_running entry in /proc/stat is confusing (according to that
    description, it looks as if procs_running could only be a number
    between 0 and the number of CPUs).

    Changed it to a more accurate description in the patch attached.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Luis Garces-Erice
     

26 Oct, 2009

1 commit

  • CPU time of a guest is always accounted in 'user' time
    without concern for the nice value of its counterpart
    process although the guest is scheduled under the nice
    value.

    This patch fixes the defect and accounts cpu time of
    a niced guest in 'nice' time as same as a niced process.

    And also the patch adds 'guest_nice' to cpuacct. The
    value provides niced guest cpu time which is like 'nice'
    to 'user'.

    The original discussions can be found here:

    http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html
    http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html

    Signed-off-by: Ryota Ozaki
    Acked-by: Avi Kivity
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ryota Ozaki
     

30 Sep, 2009

1 commit

  • The /proc/fs/ext4//mb_history was maintained manually, and had a
    number of problems: it required a largish amount of memory to be
    allocated for each ext4 filesystem, and the s_mb_history_lock
    introduced a CPU contention problem.

    By ripping out the mb_history code and replacing it with ftrace
    tracepoints, and we get more functionality: timestamps, event
    filtering, the ability to correlate mballoc history with other ext4
    tracepoints, etc.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

23 Sep, 2009

1 commit

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

22 Sep, 2009

3 commits

  • oom-killer kills a process, not task. Then oom_score should be calculated
    as per-process too. it makes consistency more and makes speed up
    select_bad_process().

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The patch makes the clear_refs more versatile in adding the option to
    select anonymous pages or file backed pages for clearing. This addition
    has a measurable impact on user space application performance as it
    decreases the number of pagewalks in scenarios where one is only
    interested in a specific type of page (anonymous or file mapped).

    The patch adds anonymous and file backed filters to the clear_refs interface.

    echo 1 > /proc/PID/clear_refs resets the bits on all pages
    echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
    echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only

    Any other value is ignored

    Signed-off-by: Moussa A. Ba
    Signed-off-by: Jared E. Hulbert
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Moussa A. Ba
     
  • We added a new column in cpuX lines of /proc/stat, to show the amount of
    time spent by a cpu servicing a guest, without updating
    Documentation/filesystems/proc.txt

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

19 Aug, 2009

1 commit

  • The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
    the mm_struct. It was a very good first step for sanitize OOM.

    However Paul Menage reported the commit makes regression to his job
    scheduler. Current OOM logic can kill OOM_DISABLED process.

    Why? His program has the code of similar to the following.

    ...
    set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
    ...
    if (vfork() == 0) {
    set_oom_adj(0); /* Invoked child can be killed */
    execve("foo-bar-cmd");
    }
    ....

    vfork() parent and child are shared the same mm_struct. then above
    set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
    change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
    lost OOM immune and it was killed.

    Actually, fork-setting-exec idiom is very frequently used in userland program.
    We must not break this assumption.

    Then, this patch revert commit 2ff05b2b and related commit.

    Reverted commit list
    ---------------------
    - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
    - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
    - commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
    - commit 933b787b57 (mm: copy over oom_adj value at fork time)

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

19 Jun, 2009

2 commits

  • An update for the "Process-Specific Subdirectories" section to reflect the
    changes till kernel 2.6.30.

    Signed-off-by: Stefani Seibold
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Export statistics for softirq in /proc/softirqs and /proc/stat.

    1. /proc/softirqs
    Implement /proc/softirqs which shows the number of softirq
    for each CPU like /proc/interrupts.

    2. /proc/stat
    Add the "softirq" line to /proc/stat.
    This line shows the number of softirq for all cpu.
    The first column is the total of all softirqs and
    each subsequent column is the total for particular softirq.

    [kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
    Signed-off-by: Keika Kobayashi
    Reviewed-by: Hiroshi Shimamoto
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: Eric Dumazet
    Cc: Alexey Dobriyan
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keika Kobayashi
     

17 Jun, 2009

1 commit

  • The per-task oom_adj value is a characteristic of its mm more than the
    task itself since it's not possible to oom kill any thread that shares the
    mm. If a task were to be killed while attached to an mm that could not be
    freed because another thread were set to OOM_DISABLE, it would have
    needlessly been terminated since there is no potential for future memory
    freeing.

    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct mm_struct. This requires task_lock() on a
    task to check its oom_adj value to protect against exec, but it's already
    necessary to take the lock when dereferencing the mm to find the total VM
    size for the badness heuristic.

    This fixes a livelock if the oom killer chooses a task and another thread
    sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
    because oom_kill_task() repeatedly returns 1 and refuses to kill the
    chosen task while select_bad_process() will repeatedly choose the same
    task during the next retry.

    Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
    oom_kill_task() to check for threads sharing the same memory will be
    removed in the next patch in this series where it will no longer be
    necessary.

    Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
    these threads are immune from oom killing already. They simply report an
    oom_adj value of OOM_DISABLE.

    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Jun, 2009

1 commit


03 Apr, 2009

1 commit

  • Now /proc/sys is described in many places and much information is
    redundant. This patch updates the proc.txt and move the /proc/sys
    desciption out to the files in Documentation/sysctls.

    Details are:

    merge
    - 2.1 /proc/sys/fs - File system data
    - 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
    - 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
    with Documentation/sysctls/fs.txt.

    remove
    - 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
    since it's not better then the Documentation/binfmt_misc.txt.

    merge
    - 2.3 /proc/sys/kernel - general kernel parameters
    with Documentation/sysctls/kernel.txt

    remove
    - 2.5 /proc/sys/dev - Device specific parameters
    since it's obsolete the sysfs is used now.

    remove
    - 2.6 /proc/sys/sunrpc - Remote procedure calls
    since it's not better then the Documentation/sysctls/sunrpc.txt

    move
    - 2.7 /proc/sys/net - Networking stuff
    - 2.9 Appletalk
    - 2.10 IPX
    to newly created Documentation/sysctls/net.txt.

    remove
    - 2.8 /proc/sys/net/ipv4 - IPV4 settings
    since it's not better then the Documentation/networking/ip-sysctl.txt.

    add
    - Chapter 3 Per-Process Parameters
    to descibe /proc//xxx parameters.

    Signed-off-by: Shen Feng
    Cc: Randy Dunlap
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shen Feng
     

31 Mar, 2009

1 commit


19 Mar, 2009

1 commit


30 Jan, 2009

1 commit


16 Jan, 2009

1 commit

  • Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
    More specifically, the section on /proc/sys/vm in
    Documentation/filesystems/proc.txt was removed and a link to
    Documentation/sysctl/vm.txt added.

    Most of the verbiage from proc.txt was simply moved in vm.txt, with new
    addtional text for "swappiness" and "stat_interval".

    Signed-off-by: Peter W Morreale
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     

08 Jan, 2009

1 commit


07 Jan, 2009

1 commit

  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

05 Jan, 2009

1 commit

  • /proc/*/stack adds the ability to query a task's stack trace. It is more
    useful than /proc/*/wchan as it provides full stack trace instead of single
    depth. Example output:

    $ cat /proc/self/stack
    [] save_stack_trace_tsk+0x17/0x35
    [] proc_pid_stack+0x4a/0x76
    [] proc_single_show+0x4a/0x5e
    [] seq_read+0xf3/0x29f
    [] vfs_read+0x6d/0x91
    [] sys_read+0x3b/0x60
    [] syscall_call+0x7/0xb
    [] 0xffffffff

    [add save_stack_trace_tsk() on mips, ACK Ralf --adobriyan]
    Signed-off-by: Ken Chen
    Signed-off-by: Ingo Molnar
    Signed-off-by: Alexey Dobriyan

    Ken Chen
     

23 Dec, 2008

1 commit

  • …86/debug', 'x86/defconfig', 'x86/detect-hyper', 'x86/doc', 'x86/dumpstack', 'x86/early-printk', 'x86/fpu', 'x86/idle', 'x86/io', 'x86/memory-corruption-check', 'x86/microcode', 'x86/mm', 'x86/mtrr', 'x86/nmi-watchdog', 'x86/pat2', 'x86/pci-ioapic-boot-irq-quirks', 'x86/ptrace', 'x86/quirks', 'x86/reboot', 'x86/setup-memory', 'x86/signal', 'x86/sparse-fixes', 'x86/time', 'x86/uv' and 'x86/xen' into x86/core

    Ingo Molnar
     

02 Dec, 2008

1 commit

  • It has been thought that the per-user file descriptors limit would also
    limit the resources that a normal user can request via the epoll
    interface. Vegard Nossum reported a very simple program (a modified
    version attached) that can make a normal user to request a pretty large
    amount of kernel memory, well within the its maximum number of fds. To
    solve such problem, default limits are now imposed, and /proc based
    configuration has been introduced. A new directory has been created,
    named /proc/sys/fs/epoll/ and inside there, there are two configuration
    points:

    max_user_instances = Maximum number of devices - per user

    max_user_watches = Maximum number of "watched" fds - per user

    The current default for "max_user_watches" limits the memory used by epoll
    to store "watches", to 1/32 of the amount of the low RAM. As example, a
    256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
    That should be enough to not break existing heavy epoll users. The
    default value for "max_user_instances" is set to 128, that should be
    enough too.

    This also changes the userspace, because a new error code can now come out
    from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
    listed, so that should be ok.

    [akpm@linux-foundation.org: use get_current_user()]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: Cyrill Gorcunov
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

31 Oct, 2008

1 commit


20 Oct, 2008

1 commit

  • The current documentation of dirty_ratio and dirty_background_ratio is a
    bit misleading.

    In the documentation we say that they are "a percentage of total system
    memory", but the current page writeback policy, intead, is to apply the
    percentages to the dirtyable memory, that means free pages + reclaimable
    pages.

    Better to be more explicit to clarify this concept.

    Signed-off-by: Andrea Righi
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi