15 Jun, 2011

1 commit

  • Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread)
    introduced performance regression. In an AIM7 test, this commit degraded
    performance by about 40%.

    The commit runs rcu callbacks in a kthread instead of softirq. We observed
    high rate of context switch which is caused by this. Out test system has
    64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
    which is caused by RCU's per-CPU kthread. A trace showed that most of
    the time the RCU per-CPU kthread doesn't actually handle any callbacks,
    but instead just does a very small amount of work handling grace periods.
    This means that RCU's per-CPU kthreads are making the scheduler do quite
    a bit of work in order to allow a very small amount of RCU-related
    processing to be done.

    Alex Shi's analysis determined that this slowdown is due to lock
    contention within the scheduler. Unfortunately, as Peter Zijlstra points
    out, the scheduler's real-time semantics require global action, which
    means that this contention is inherent in real-time scheduling. (Yes,
    perhaps someone will come up with a workaround -- otherwise, -rt is not
    going to do well on large SMP systems -- but this patch will work around
    this issue in the meantime. And "the meantime" might well be forever.)

    This patch therefore re-introduces softirq processing to RCU, but only
    for core RCU work. RCU callbacks are still executed in kthread context,
    so that only a small amount of RCU work runs in softirq context in the
    common case. This should minimize ksoftirqd execution, allowing us to
    skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.

    Signed-off-by: Shaohua Li
    Tested-by: "Alex,Shi"
    Signed-off-by: Paul E. McKenney

    Shaohua Li
     

25 May, 2011

1 commit

  • Manually adjusting the smp_affinity for IRQ's becomes unwieldy when the
    cpu count is large.

    Setting smp affinity to cpus 256 to 263 would be:

    echo 000000ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > smp_affinity

    instead of:

    echo 256-263 > smp_affinity_list

    Think about what it looks like for cpus around say, 4088 to 4095.

    We already have many alternate "list" interfaces:

    /sys/devices/system/cpu/cpuX/indexY/shared_cpu_list
    /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
    /sys/devices/system/cpu/cpuX/topology/core_siblings_list
    /sys/devices/system/node/nodeX/cpulist
    /sys/devices/pci***/***/local_cpulist

    Add a companion interface, smp_affinity_list to use cpu lists instead of
    cpu maps. This conforms to other companion interfaces where both a map
    and a list interface exists.

    This required adding a bitmap_parselist_user() function in a manner
    similar to the bitmap_parse_user() function.

    [akpm@linux-foundation.org: make __bitmap_parselist() static]
    Signed-off-by: Mike Travis
    Cc: Thomas Gleixner
    Cc: Jack Steiner
    Cc: Lee Schermerhorn
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Travis
     

06 May, 2011

1 commit

  • If RCU priority boosting is to be meaningful, callback invocation must
    be boosted in addition to preempted RCU readers. Otherwise, in presence
    of CPU real-time threads, the grace period ends, but the callbacks don't
    get invoked. If the callbacks don't get invoked, the associated memory
    doesn't get freed, so the system is still subject to OOM.

    But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
    moves the callback invocations to a kthread, which can be boosted easily.

    Also add comments and properly synchronized all accesses to
    rcu_cpu_kthread_task, as suggested by Lai Jiangshan.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

31 Mar, 2011

1 commit


14 Jan, 2011

2 commits

  • We'd like to be able to oom_score_adj a process up/down as it
    enters/leaves the foreground. Currently, it is not possible to oom_adj
    down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
    oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
    or its inherited value at fork. Assuming the thread that has forked it
    has oom_score_adj of 0, each process could decrease it back from 0 upon
    activation unless a CAP_SYS_RESOURCE thread elevated it to something
    higher.

    Alternative considered:

    * a setuid binary
    * a daemon with CAP_SYS_RESOURCE

    Since you don't wan't all processes to be able to reduce their oom_adj, a
    setuid or daemon implementation would be complex. The alternatives also
    have much higher overhead.

    This patch updated from original patch based on feedback from David
    Rientjes.

    Signed-off-by: Mandeep Singh Baines
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • Currently there is no way to find whether a process has locked its pages
    in memory or not. And which of the memory regions are locked in memory.

    Add a new field "Locked" to export this information via the smaps file.

    Signed-off-by: Nikanth Karthikesan
    Acked-by: Balbir Singh
    Acked-by: Wu Fengguang
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     

17 Nov, 2010

1 commit

  • It allows users to see what consoles are currently known to the system
    and with what flags.

    It is based on Werner's patch, the part about traversing fds was
    removed, the code was moved to kernel/printk.c, where consoles are
    handled and it makes more sense to me.

    Signed-off-by: Jiri Slaby [cleanups]
    Signed-off-by: "Dr. Werner Fink"
    Cc: Al Viro
    Cc: Greg Kroah-Hartman
    Signed-off-by: Greg Kroah-Hartman

    Jiri Slaby
     

28 Oct, 2010

2 commits

  • Document /proc/pid/pagemap in Documentation/filesystems/proc.txt

    Signed-off-by: Nikanth Karthikesan
    Cc: Richard Guenther
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     
  • Export the number of anonymous pages in a mapping via smaps.

    Even the private pages in a mapping backed by a file, would be marked as
    anonymous, when they are modified. Export this information to user-space via
    smaps.

    Exporting this count will help gdb to make a better decision on which
    areas need to be dumped in its coredump; and should be useful to others
    studying the memory usage of a process.

    Signed-off-by: Nikanth Karthikesan
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     

27 Oct, 2010

1 commit


23 Oct, 2010

2 commits

  • This reverts commit f4a3e0bceb57466c31757f25e4e0ed108d1299ec. Jiri
    Sladby points out that the tty structure we're using may already be
    gone, and Al Viro doesn't hold back in complaining about the random
    loading of 'filp->private_data' which doesn't have to be a pointer at
    all, nor does checking the magic field for TTY_MAGIC prove anything.

    Belated review by Al:

    "a) global variable depending on stdin of the last opener? Affecting
    output of read(2)? Really?

    b) iterator is broken; list should be locked in ->start(), unlocked in
    ->stop() and *NOT* unlocked/relocked in ->next()

    c) ->show() ought to do nothing in case of ->device == NULL, instead
    of skipping those in ->next()/->start()

    d) regardless of the merits of the bright idea about asterisk at that
    line in output *and* regardless of (a), the implementation is not
    only atrociously ugly, it's actually very likely to be a roothole.
    Verifying that Cthulhu knows what number happens to be address of a
    tty_struct by blindly dereferencing memory at that address...
    Ouch.

    Please revert that crap."

    And Christoph pipes in and NAK's the approach of walking fd tables etc
    too. So it's pretty unanimous.

    Noticed-by: Jri Slaby
    Requested-by: Al Viro
    Cc: Greg Kroah-Hartman
    Cc: Werner Fink
    Cc: Alan Cox
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Add a new file /proc/tty/consoles to be able to determine the registered
    system console lines. If the reading process holds /dev/console open at
    the regular standard input stream the active device will be marked by an
    asterisk. Show possible operations and also decode the used flags of
    the listed console lines.

    Signed-off-by: Werner Fink
    Cc: Alan Cox
    Signed-off-by: Greg Kroah-Hartman

    Dr. Werner Fink
     

10 Aug, 2010

2 commits

  • /proc/pid/oom_adj is now deprecated so that that it may eventually be
    removed. The target date for removal is August 2012.

    A warning will be printed to the kernel log if a task attempts to use this
    interface. Future warning will be suppressed until the kernel is rebooted
    to prevent spamming the kernel log.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Aug, 2010

1 commit

  • Below you will find an updated version from the original series bunching all patches into one big patch
    updating broken web addresses that are located in Documentation/*
    Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
    the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
    Now there are also some addresses pointing to .spec files some are located, but some(after searching
    on the companies site)where still no where to be found. In this case I just changed the address
    to the company site this way the users can contact the company and they can locate them for the users.

    Signed-off-by: Justin P. Mattock
    Signed-off-by: Thomas Weber
    Signed-off-by: Mike Frysinger
    Cc: Paulo Marques
    Cc: Randy Dunlap
    Cc: Michael Neuling
    Signed-off-by: Jiri Kosina

    Justin P. Mattock
     

21 May, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
    vlynq: make whole Kconfig-menu dependant on architecture
    add descriptive comment for TIF_MEMDIE task flag declaration.
    EEPROM: max6875: Header file cleanup
    EEPROM: 93cx6: Header file cleanup
    EEPROM: Header file cleanup
    agp: use NULL instead of 0 when pointer is needed
    rtc-v3020: make bitfield unsigned
    PCI: make bitfield unsigned
    jbd2: use NULL instead of 0 when pointer is needed
    cciss: fix shadows sparse warning
    doc: inode uses a mutex instead of a semaphore.
    uml: i386: Avoid redefinition of NR_syscalls
    fix "seperate" typos in comments
    cocbalt_lcdfb: correct sections
    doc: Change urls for sparse
    Powerpc: wii: Fix typo in comment
    i2o: cleanup some exit paths
    Documentation/: it's -> its where appropriate
    UML: Fix compiler warning due to missing task_struct declaration
    UML: add kernel.h include to signal.c
    ...

    Linus Torvalds
     

20 May, 2010

1 commit


12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

23 Apr, 2010

1 commit


24 Mar, 2010

1 commit

  • Expose irq_desc->node as /proc/irq/*/node.

    This file provides device hardware locality information for apps
    desiring to include hardware locality in irq mapping decisions.

    Signed-off-by: Dimitri Sivanich
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    Dimitri Sivanich
     

15 Mar, 2010

1 commit


08 Mar, 2010

1 commit


07 Mar, 2010

3 commits

  • Add documentation for /proc/pagetypeinfo.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

18 Feb, 2010

1 commit


12 Jan, 2010

1 commit

  • Commit d899bf7b (procfs: provide stack information for threads) introduced
    to show stack information in /proc/{pid}/status. But it cause large
    performance regression. Unfortunately /proc/{pid}/status is used ps
    command too and ps is one of most important component. Because both to
    take mmap_sem and page table walk are heavily operation.

    If many process run, the ps performance is,

    [before d899bf7b]

    % perf stat ps >/dev/null

    Performance counter stats for 'ps':

    4090.435806 task-clock-msecs # 0.032 CPUs
    229 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    234 page-faults # 0.000 M/sec
    8587565207 cycles # 2099.425 M/sec
    9866662403 instructions # 1.149 IPC
    3789415411 cache-references # 926.409 M/sec
    30419509 cache-misses # 7.437 M/sec

    128.859521955 seconds time elapsed

    [after d899bf7b]

    % perf stat ps > /dev/null

    Performance counter stats for 'ps':

    4305.081146 task-clock-msecs # 0.028 CPUs
    480 context-switches # 0.000 M/sec
    2 CPU-migrations # 0.000 M/sec
    237 page-faults # 0.000 M/sec
    9021211334 cycles # 2095.480 M/sec
    10605887536 instructions # 1.176 IPC
    3612650999 cache-references # 839.160 M/sec
    23917502 cache-misses # 5.556 M/sec

    152.277819582 seconds time elapsed

    Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
    provide almost same information. we can use it.

    Commit d899bf7b introduced two features:

    1) Add the annotattion of [thread stack: xxxx] mark to
    /proc/{pid}/task/{tid}/maps.
    2) Add StackUsage field to /proc/{pid}/status.

    I only revert (2), because I haven't seen (1) cause regression.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Cc: Andrew Morton
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Dec, 2009

1 commit

  • Setting a thread's comm to be something unique is a very useful ability
    and is helpful for debugging complicated threaded applications. However
    currently the only way to set a thread name is for the thread to name
    itself via the PR_SET_NAME prctl.

    However, there may be situations where it would be advantageous for a
    thread dispatcher to be naming the threads its managing, rather then
    having the threads self-describe themselves. This sort of behavior is
    available on other systems via the pthread_setname_np() interface.

    This patch exports a task's comm via proc/pid/comm and
    proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
    these values.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: John Stultz
    Cc: Andi Kleen
    Cc: Arjan van de Ven
    Cc: Mike Fulton
    Cc: Sean Foley
    Cc: Darren Hart
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     

10 Dec, 2009

1 commit

  • the description in Documentation/filesystems/proc.txt of the
    procs_running entry in /proc/stat is confusing (according to that
    description, it looks as if procs_running could only be a number
    between 0 and the number of CPUs).

    Changed it to a more accurate description in the patch attached.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Luis Garces-Erice
     

26 Oct, 2009

1 commit

  • CPU time of a guest is always accounted in 'user' time
    without concern for the nice value of its counterpart
    process although the guest is scheduled under the nice
    value.

    This patch fixes the defect and accounts cpu time of
    a niced guest in 'nice' time as same as a niced process.

    And also the patch adds 'guest_nice' to cpuacct. The
    value provides niced guest cpu time which is like 'nice'
    to 'user'.

    The original discussions can be found here:

    http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html
    http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html

    Signed-off-by: Ryota Ozaki
    Acked-by: Avi Kivity
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ryota Ozaki
     

30 Sep, 2009

1 commit

  • The /proc/fs/ext4//mb_history was maintained manually, and had a
    number of problems: it required a largish amount of memory to be
    allocated for each ext4 filesystem, and the s_mb_history_lock
    introduced a CPU contention problem.

    By ripping out the mb_history code and replacing it with ftrace
    tracepoints, and we get more functionality: timestamps, event
    filtering, the ability to correlate mballoc history with other ext4
    tracepoints, etc.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

23 Sep, 2009

1 commit

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

22 Sep, 2009

3 commits

  • oom-killer kills a process, not task. Then oom_score should be calculated
    as per-process too. it makes consistency more and makes speed up
    select_bad_process().

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The patch makes the clear_refs more versatile in adding the option to
    select anonymous pages or file backed pages for clearing. This addition
    has a measurable impact on user space application performance as it
    decreases the number of pagewalks in scenarios where one is only
    interested in a specific type of page (anonymous or file mapped).

    The patch adds anonymous and file backed filters to the clear_refs interface.

    echo 1 > /proc/PID/clear_refs resets the bits on all pages
    echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
    echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only

    Any other value is ignored

    Signed-off-by: Moussa A. Ba
    Signed-off-by: Jared E. Hulbert
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Moussa A. Ba
     
  • We added a new column in cpuX lines of /proc/stat, to show the amount of
    time spent by a cpu servicing a guest, without updating
    Documentation/filesystems/proc.txt

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

19 Aug, 2009

1 commit

  • The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
    the mm_struct. It was a very good first step for sanitize OOM.

    However Paul Menage reported the commit makes regression to his job
    scheduler. Current OOM logic can kill OOM_DISABLED process.

    Why? His program has the code of similar to the following.

    ...
    set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
    ...
    if (vfork() == 0) {
    set_oom_adj(0); /* Invoked child can be killed */
    execve("foo-bar-cmd");
    }
    ....

    vfork() parent and child are shared the same mm_struct. then above
    set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
    change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
    lost OOM immune and it was killed.

    Actually, fork-setting-exec idiom is very frequently used in userland program.
    We must not break this assumption.

    Then, this patch revert commit 2ff05b2b and related commit.

    Reverted commit list
    ---------------------
    - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
    - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
    - commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
    - commit 933b787b57 (mm: copy over oom_adj value at fork time)

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

19 Jun, 2009

2 commits

  • An update for the "Process-Specific Subdirectories" section to reflect the
    changes till kernel 2.6.30.

    Signed-off-by: Stefani Seibold
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Export statistics for softirq in /proc/softirqs and /proc/stat.

    1. /proc/softirqs
    Implement /proc/softirqs which shows the number of softirq
    for each CPU like /proc/interrupts.

    2. /proc/stat
    Add the "softirq" line to /proc/stat.
    This line shows the number of softirq for all cpu.
    The first column is the total of all softirqs and
    each subsequent column is the total for particular softirq.

    [kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
    Signed-off-by: Keika Kobayashi
    Reviewed-by: Hiroshi Shimamoto
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: Eric Dumazet
    Cc: Alexey Dobriyan
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keika Kobayashi
     

17 Jun, 2009

1 commit

  • The per-task oom_adj value is a characteristic of its mm more than the
    task itself since it's not possible to oom kill any thread that shares the
    mm. If a task were to be killed while attached to an mm that could not be
    freed because another thread were set to OOM_DISABLE, it would have
    needlessly been terminated since there is no potential for future memory
    freeing.

    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct mm_struct. This requires task_lock() on a
    task to check its oom_adj value to protect against exec, but it's already
    necessary to take the lock when dereferencing the mm to find the total VM
    size for the badness heuristic.

    This fixes a livelock if the oom killer chooses a task and another thread
    sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
    because oom_kill_task() repeatedly returns 1 and refuses to kill the
    chosen task while select_bad_process() will repeatedly choose the same
    task during the next retry.

    Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
    oom_kill_task() to check for threads sharing the same memory will be
    removed in the next patch in this series where it will no longer be
    necessary.

    Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
    these threads are immune from oom killing already. They simply report an
    oom_adj value of OOM_DISABLE.

    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Jun, 2009

1 commit