12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

07 Mar, 2010

7 commits

  • Modify uid check in do_coredump so as to not apply it in the case of
    pipes.

    This just got noticed in testing. The end of do_coredump validates the
    uid of the inode for the created file against the uid of the crashing
    process to ensure that no one can pre-create a core file with different
    ownership and grab the information contained in the core when they
    shouldn' tbe able to. This causes failures when using pipes for a core
    dumps if the crashing process is not root, which is the uid of the pipe
    when it is created.

    The fix is simple. Since the check for matching uid's isn't relevant for
    pipes (a process can't create a pipe that the uermodehelper code will open
    anyway), we can just just skip it in the event ispipe is non-zero

    Reverts a pipe-affecting change which was accidentally made in

    : commit c46f739dd39db3b07ab5deb4e3ec81e1c04a91af
    : Author: Ingo Molnar
    : AuthorDate: Wed Nov 28 13:59:18 2007 +0100
    : Commit: Linus Torvalds
    : CommitDate: Wed Nov 28 10:58:01 2007 -0800
    :
    : vfs: coredumping fix

    Signed-off-by: Neil Horman
    Cc: Andi Kleen
    Cc: Oleg Nesterov
    Cc: Alan Cox
    Cc: Al Viro
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • User visible change.

    do_coredump() kills all threads which share the same ->mm but only the
    coredumping process gets the proper exit_code. Other tasks which share
    the same ->mm die "silently" and return status == 0 to parent.

    This is historical behaviour, not actually a bug. But I think Frank
    Heckenbach rightly dislikes the current behaviour. Simple test-case:

    #include
    #include
    #include
    #include

    int main(void)
    {
    int stat;

    if (!fork()) {
    if (!vfork())
    kill(getpid(), SIGQUIT);
    }

    wait(&stat);
    printf("stat=%x\n", stat);
    return 0;
    }

    Before this patch it prints "stat=0" despite the fact the child was killed
    by SIGQUIT. After this patch the output is "stat=3" which obviously makes
    more sense.

    Even with this patch, only the task which originates the coredumping gets
    "|= 0x80" if the core was actually dumped, but at least the coredumping
    signal is visible to do_wait/etc.

    Reported-by: Frank Heckenbach
    Signed-off-by: Oleg Nesterov
    Acked-by: WANG Cong
    Cc: Roland McGrath
    Cc: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Pass mm->flags as a coredump parameter for consistency.

    ---
    1787 if (mm->core_state || !get_dumpable(mm)) { mmap_sem);
    1789 put_cred(cred);
    1790 goto fail;
    1791 }
    1792
    [...]
    1798 if (get_dumpable(mm) == 2) { /* Setuid core dump mode */ fsuid = 0; /* Dump root private */
    1801 }
    ---

    Since dumpable bits are not protected by lock, there is a chance to change
    these bits between (1) and (2).

    To solve this issue, this patch copies mm->flags to
    coredump_params.mm_flags at the beginning of do_coredump() and uses it
    instead of get_dumpable() while dumping core.

    This copy is also passed to binfmt->core_dump, since elf*_core_dump() uses
    dump_filter bits in mm->flags.

    [akpm@linux-foundation.org: fix merge]
    Signed-off-by: Masami Hiramatsu
    Acked-by: Roland McGrath
    Cc: Hidehiro Kawai
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masami Hiramatsu
     
  • Currently we create the initial stack based on the PAGE_SIZE. This is
    unnecessary.

    This creates this initial stack independent of the PAGE_SIZE.

    It also bumps up the number of 4k pages allocated from 20 to 32, to
    align with 64K page systems.

    Signed-off-by: Michael Neuling
    Cc: Helge Deller
    Reviewed-by: KOSAKI Motohiro
    Cc: Americo Wang
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Neuling
     
  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Feb, 2010

1 commit

  • 803bf5ec259941936262d10ecc84511b76a20921 ("fs/exec.c: restrict initial
    stack space expansion to rlimit") attempts to limit the initial stack to
    20*PAGE_SIZE. Unfortunately, in attempting ensure the stack is not
    reduced in size, we ended up not changing the stack at all.

    This size reduction check is not necessary as the expand_stack call does
    this already.

    This caused a regression in UML resulting in most guest processes being
    killed.

    Signed-off-by: Michael Neuling
    Reviewed-by: KOSAKI Motohiro
    Acked-by: WANG Cong
    Cc: Anton Blanchard
    Cc: Oleg Nesterov
    Cc: James Morris
    Cc: Serge Hallyn
    Cc: Benjamin Herrenschmidt
    Cc: Jouni Malinen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Neuling
     

12 Feb, 2010

1 commit

  • When reserving stack space for a new process, make sure we're not
    attempting to expand the stack by more than rlimit allows.

    This fixes a bug caused by b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
    variable length argument support") and unmasked by
    fc63cf237078c86214abcb2ee9926d8ad289da9b ("exec: setup_arg_pages() fails
    to return errors").

    This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
    80K on 4K pages or 'ulimit -s 79') all processes will be killed before
    they start. This is particularly bad with 64K pages, where a ulimit below
    1280K will kill every process.

    To test, do:

    'ulimit -s 15; ls'

    before and after the patch is applied. Before it's applied, 'ls' should
    be killed. After the patch is applied, 'ls' should no longer be killed.

    A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
    Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
    correctly with this code.

    4K pages should be fine to test with.

    [kosaki.motohiro@jp.fujitsu.com: cleanup]
    [akpm@linux-foundation.org: cleanup cleanup]
    Signed-off-by: Michael Neuling
    Signed-off-by: KOSAKI Motohiro
    Cc: Americo Wang
    Cc: Anton Blanchard
    Cc: Oleg Nesterov
    Cc: James Morris
    Cc: Ingo Molnar
    Cc: Serge Hallyn
    Cc: Benjamin Herrenschmidt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Neuling
     

03 Feb, 2010

1 commit

  • Commit 221af7f87b9 ("Split 'flush_old_exec' into two functions") split
    the function at the point of no return - ie right where there were no
    more error cases to check. That made sense from a technical standpoint,
    but when we then also combined it with the actual personality setting
    going in between flush_old_exec() and setup_new_exec(), it needs to be a
    bit more careful.

    In particular, we need to make sure that we really flush the old
    personality bits in the 'flush' stage, rather than later in the 'setup'
    stage, since otherwise we might be flushing the _new_ personality state
    that we're just setting up.

    So this moves the flags and personality flushing (and 'flush_thread()',
    which is the arch-specific function that generally resets lazy FP state
    etc) of the old process into flush_old_exec(), so that it doesn't affect
    any state that execve() is setting up for the new process environment.

    This was reported by Michal Simek as breaking his Microblaze qemu
    environment.

    Reported-and-tested-by: Michal Simek
    Cc: Peter Anvin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Jan, 2010

1 commit

  • 'flush_old_exec()' is the point of no return when doing an execve(), and
    it is pretty badly misnamed. It doesn't just flush the old executable
    environment, it also starts up the new one.

    Which is very inconvenient for things like setting up the new
    personality, because we want the new personality to affect the starting
    of the new environment, but at the same time we do _not_ want the new
    personality to take effect if flushing the old one fails.

    As a result, the x86-64 '32-bit' personality is actually done using this
    insane "I'm going to change the ABI, but I haven't done it yet" bit
    (TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
    personality, but just the "pending" bit, so that "flush_thread()" can do
    the actual personality magic.

    This patch in no way changes any of that insanity, but it does split the
    'flush_old_exec()' function up into a preparatory part that can fail
    (still called flush_old_exec()), and a new part that will actually set
    up the new exec environment (setup_new_exec()). All callers are changed
    to trivially comply with the new world order.

    Signed-off-by: H. Peter Anvin
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Dec, 2009

2 commits

  • Introduce coredump parameter data structure (struct coredump_params) to
    simplify binfmt->core_dump() arguments.

    Signed-off-by: Masami Hiramatsu
    Suggested-by: Ingo Molnar
    Cc: Hidehiro Kawai
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masami Hiramatsu
     
  • Thanks to Roland who pointed out de_thread() issues.

    Currently we add sub-threads to ->real_parent->children list. This buys
    nothing but slows down do_wait().

    With this patch ->children contains only main threads (group leaders).
    The only complication is that forget_original_parent() should iterate over
    sub-threads by hand, and de_thread() needs another list_replace() when it
    changes ->group_leader.

    Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
    tasks, we can remove this check (and we can unify do_wait_thread() and
    ptrace_do_wait()).

    This change can confuse the optimistic search in mm_update_next_owner(),
    but this is fixable and minor.

    Perhaps badness() and oom_kill_process() should be updated, but they
    should be fixed in any case.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Dec, 2009

1 commit

  • Setting a thread's comm to be something unique is a very useful ability
    and is helpful for debugging complicated threaded applications. However
    currently the only way to set a thread name is for the thread to name
    itself via the PR_SET_NAME prctl.

    However, there may be situations where it would be advantageous for a
    thread dispatcher to be naming the threads its managing, rather then
    having the threads self-describe themselves. This sort of behavior is
    available on other systems via the pthread_setname_np() interface.

    This patch exports a task's comm via proc/pid/comm and
    proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
    these values.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: John Stultz
    Cc: Andi Kleen
    Cc: Arjan van de Ven
    Cc: Mike Fulton
    Cc: Sean Foley
    Cc: Darren Hart
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     

03 Dec, 2009

1 commit


12 Nov, 2009

1 commit

  • In setup_arg_pages we work hard to assign a value to ret, but on exit we
    always return 0.

    Also remove a now duplicated exit path and branch to out_unlock instead.

    Signed-off-by: Anton Blanchard
    Acked-by: Serge Hallyn
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

25 Oct, 2009

1 commit


24 Sep, 2009

5 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • sys_delete_module() can set MODULE_STATE_GOING after
    search_binary_handler() does try_module_get(). In this case
    set_binfmt()->try_module_get() fails but since none of the callers
    check the returned error, the task will run with the wrong old
    ->binfmt.

    The proper fix should change all ->load_binary() methods, but we can
    rely on fact that the caller must hold a reference to binfmt->module
    and use __module_get() which never fails.

    Signed-off-by: Oleg Nesterov
    Acked-by: Rusty Russell
    Cc: Hiroshi Shimamoto
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Allow core_pattern pipes to wait for user space to complete

    One of the things that user space processes like to do is look at metadata
    for a crashing process in their /proc/ directory. this is racy
    however, since do_coredump in the kernel doesn't wait for the user space
    process to complete before it reaps the crashing process. This patch
    corrects that. Allowing the kernel to wait for the user space process to
    complete before cleaning up the crashing process. This is a bit tricky to
    do for a few reasons:

    1) The user space process isn't our child, so we can't sys_wait4 on it
    2) We need to close the pipe before waiting for the user process to complete,
    since the user process may rely on an EOF condition

    I've discussed several solutions with Oleg Nesterov off-list about this,
    and this is the one we've come up with. We add ourselves as a pipe reader
    (to prevent premature cleanup of the pipe_inode_info), and remove
    ourselves as a writer (to provide an EOF condition to the writer in user
    space), then we iterate until the user space process exits (which we
    detect by pipe->readers == 1, hence the > 1 check in the loop). When we
    exit the loop, we restore the proper reader/writer values, then we return
    and let filp_close in do_coredump clean up the pipe data properly.

    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Introduce core pipe limiting sysctl.

    Since we can dump cores to pipe, rather than directly to the filesystem,
    we create a condition in which a user can create a very high load on the
    system simply by running bad applications.

    If the pipe reader specified in core_pattern is poorly written, we can
    have lots of ourstandig resources and processes in the system.

    This sysctl introduces an ability to limit that resource consumption.
    core_pipe_limit defines how many in-flight dumps may be run in parallel,
    dumps beyond this value are skipped and a note is made in the kernel log.
    A special value of 0 in core_pipe_limit denotes unlimited core dumps may
    be handled (this is the default value).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Change how we detect recursive dumps.

    Currently we have a mechanism by which we try to compare pathnames of the
    crashing process to the core_pattern path. This is broken for a dozen
    reasons, and just doesn't work in any sort of robust way.

    I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps
    set RLIMIT_CORE to zero, we don't write out core files for any process
    with that particular limit set. It the core_pattern is a pipe, any
    non-zero limit is translated to RLIM_INFINITY.

    This allows complete dumps to be captured, but prevents infinite recursion
    in the event that the core_pattern process itself crashes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     

23 Sep, 2009

2 commits

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

06 Sep, 2009

1 commit

  • Tom Horsley reports that his debugger hangs when it tries to read
    /proc/pid_of_tracee/maps, this happens since

    "mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
    04b836cbf19e885f8366bccb2e4b0474346c02d

    commit in 2.6.31.

    But the root of the problem lies in the fact that do_execve() path calls
    tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.

    The tracee must not sleep in TASK_TRACED holding this mutex. Even if we
    remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
    another task doing PTRACE_ATTACH should not hang until it is killed or the
    tracee resumes.

    With this patch do_execve() does not use ->cred_guard_mutex directly and
    we do not hold it throughout, instead:

    - introduce prepare_bprm_creds() helper, it locks the mutex
    and calls prepare_exec_creds() to initialize bprm->cred.

    - install_exec_creds() drops the mutex after commit_creds(),
    and thus before tracehook_report_exec()->ptrace_stop().

    or, if exec fails,

    free_bprm() drops this mutex when bprm->cred != NULL which
    indicates install_exec_creds() was not called.

    Reported-by: Tom Horsley
    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Roland McGrath
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

24 Aug, 2009

1 commit

  • vfs_read() offset is defined as loff_t, but kernel_read()
    offset is only defined as unsigned long. Redefine
    kernel_read() offset as loff_t.

    Cc: stable@kernel.org
    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     

07 Jul, 2009

1 commit

  • do_execve() and ptrace_attach() return -EINTR if
    mutex_lock_interruptible(->cred_guard_mutex) fails.

    This is not right, change the code to return ERESTARTNOINTR.

    Perhaps we should also change proc_pid_attr_write().

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Acked-by: Roland McGrath
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 Jun, 2009

1 commit

  • …el/git/tip/linux-2.6-tip

    * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
    perf_counter: Turn off by default
    perf_counter: Add counter->id to the throttle event
    perf_counter: Better align code
    perf_counter: Rename L2 to LL cache
    perf_counter: Standardize event names
    perf_counter: Rename enums
    perf_counter tools: Clean up u64 usage
    perf_counter: Rename perf_counter_limit sysctl
    perf_counter: More paranoia settings
    perf_counter: powerpc: Implement generalized cache events for POWER processors
    perf_counters: powerpc: Add support for POWER7 processors
    perf_counter: Accurate period data
    perf_counter: Introduce struct for sample data
    perf_counter tools: Normalize data using per sample period data
    perf_counter: Annotate exit ctx recursion
    perf_counter tools: Propagate signals properly
    perf_counter tools: Small frequency related fixes
    perf_counter: More aggressive frequency adjustment
    perf_counter/x86: Fix the model number of Intel Core2 processors
    perf_counter, x86: Correct some event and umask values for Intel processors
    ...

    Linus Torvalds
     

22 May, 2009

2 commits

  • Conflicts:
    fs/exec.c

    Removed IMA changes (the IMA checks are now performed via may_open()).

    Signed-off-by: James Morris

    James Morris
     
  • - Add support in ima_path_check() for integrity checking without
    incrementing the counts. (Required for nfsd.)
    - rename and export opencount_get to ima_counts_get
    - replace ima_shm_check calls with ima_counts_get
    - export ima_path_check

    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     

18 May, 2009

1 commit


11 May, 2009

1 commit


09 May, 2009

2 commits


03 May, 2009

1 commit

  • This fixes the problem introduced by commit 3bfacef412 (get rid of
    special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
    when binfmt_aout built as module. That happens because aout binary
    handler gets on the top of the binfmt list due to late registration, and
    kernel attempts to execute the binary without preparatory work that must
    be done by binfmt_loader.

    Fixed by changing the registration order of the default binfmt handlers
    using list_add_tail() and introducing insert_binfmt() function which
    places new handler on the top of the binfmt list. This might be generally
    useful for installing arch-specific frontends for default handlers or just
    for overriding them.

    Signed-off-by: Ivan Kokshaysky
    Cc: Al Viro
    Cc: Richard Henderson
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     

29 Apr, 2009

1 commit


24 Apr, 2009

2 commits

  • write_lock(¤t->fs->lock) guarantees we can't wrongly miss
    LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock()
    instead of ->siglock to iterate over the sub-threads. We must see
    all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it
    takes fs->lock too.

    With or without this patch we can miss the freshly cloned thread
    and set LSM_UNSAFE_SHARE, we don't care.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    [ Fixed lock/unlock typo - Hugh ]
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
    unconditionally. This is wrong if we race with our sub-thread which
    also does do_execve:

    Two threads T1 and T2 and another process P, all share the same
    ->fs.

    T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
    ->fs is shared, we set LSM_UNSAFE but not ->in_exec.

    P exits and decrements fs->users.

    T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
    shared, we set fs->in_exec.

    T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
    return to the user-space.

    T1 does clone(CLONE_FS /* without CLONE_THREAD */).

    T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
    another process.

    Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
    do_execve() to clear ->in_exec depending on res.

    When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
    It can be set only if we don't share ->fs with another process, and since
    we already killed all sub-threads either ->in_exec == 0 or we are the
    only user of this ->fs.

    Also, we do not need fs->lock to clear fs->in_exec.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Oleg Nesterov