18 Dec, 2009

1 commit

  • Thanks to Roland who pointed out de_thread() issues.

    Currently we add sub-threads to ->real_parent->children list. This buys
    nothing but slows down do_wait().

    With this patch ->children contains only main threads (group leaders).
    The only complication is that forget_original_parent() should iterate over
    sub-threads by hand, and de_thread() needs another list_replace() when it
    changes ->group_leader.

    Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
    tasks, we can remove this check (and we can unify do_wait_thread() and
    ptrace_do_wait()).

    This change can confuse the optimistic search in mm_update_next_owner(),
    but this is fixable and minor.

    Perhaps badness() and oom_kill_process() should be updated, but they
    should be fixed in any case.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Dec, 2009

2 commits

  • If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child
    starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent.
    This is not right, especially when the new child is not auto-attaced: in
    this case it is killed by SIGTRAP.

    Change copy_process() to call user_disable_single_step(). Tested on x86.

    Test-case:

    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    int pid, status;

    if (!(pid = fork())) {
    assert(ptrace(PTRACE_TRACEME) == 0);
    kill(getpid(), SIGSTOP);

    if (!fork()) {
    /* kernel bug: this child will be killed by SIGTRAP */
    printf("Hello world\n");
    return 43;
    }

    wait(&status);
    return WEXITSTATUS(status);
    }

    for (;;) {
    assert(pid == wait(&status));
    if (WIFEXITED(status))
    break;
    assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
    }

    assert(WEXITSTATUS(status) == 43);
    return 0;
    }

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • In massive parallel enviroment, res_counter can be a performance
    bottleneck. One strong techinque to reduce lock contention is reducing
    calls by coalescing some amount of calls into one.

    Considering charge/uncharge chatacteristic,
    - charge is done one by one via demand-paging.
    - uncharge is done by
    - in chunk at munmap, truncate, exit, execve...
    - one by one via vmscan/paging.

    It seems we have a chance to coalesce uncharges for improving scalability
    at unmap/truncation.

    This patch is a for coalescing uncharge. For avoiding scattering memcg's
    structure to functions under /mm, this patch adds memcg batch uncharge
    information to the task. A reason for per-task batching is for making use
    of caller's context information. We do batched uncharge (deleyed
    uncharge) when truncation/unmap occurs but do direct uncharge when
    uncharge is called by memory reclaim (vmscan.c).

    The degree of coalescing depends on callers
    - at invalidate/trucate... pagevec size
    - at unmap ....ZAP_BLOCK_SIZE
    (memory itself will be freed in this degree.)
    Then, we'll not coalescing too much.

    On x86-64 8cpu server, I tested overheads of memcg at page fault by
    running a program which does map/fault/unmap in a loop. Running
    a task per a cpu by taskset and see sum of the number of page faults
    in 60secs.

    [without memcg config]
    40156968 page-faults # 0.085 M/sec ( +- 0.046% )
    27.67 cache-miss/faults
    [root cgroup]
    36659599 page-faults # 0.077 M/sec ( +- 0.247% )
    31.58 miss/faults
    [in a child cgroup]
    18444157 page-faults # 0.039 M/sec ( +- 0.133% )
    69.96 miss/faults
    [child with this patch]
    27133719 page-faults # 0.057 M/sec ( +- 0.155% )
    47.16 miss/faults

    We can see some amounts of improvement.
    (root cgroup doesn't affected by this patch)
    Another patch for "charge" will follow this and above will be improved more.

    Changelog(since 2009/10/02):
    - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
    - some clean up and commentary/description updates.
    - added initialize code to copy_process(). (possible bug fix)

    Changelog(old):
    - fixed !CONFIG_MEM_CGROUP case.
    - rebased onto the latest mmotm + softlimit fix patches.
    - unified patch for callers
    - added commetns.
    - make ->do_batch as bool.
    - removed css_get() at el. We don't need it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

15 Dec, 2009

1 commit


09 Dec, 2009

2 commits

  • * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
    cfq-iosched: Do not access cfqq after freeing it
    block: include linux/err.h to use ERR_PTR
    cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
    blkio: Allow CFQ group IO scheduling even when CFQ is a module
    blkio: Implement dynamic io controlling policy registration
    blkio: Export some symbols from blkio as its user CFQ can be a module
    block: Fix io_context leak after failure of clone with CLONE_IO
    block: Fix io_context leak after clone with CLONE_IO
    cfq-iosched: make nonrot check logic consistent
    io controller: quick fix for blk-cgroup and modular CFQ
    cfq-iosched: move IO controller declerations to a header file
    cfq-iosched: fix compile problem with !CONFIG_CGROUP
    blkio: Documentation
    blkio: Wait on sync-noidle queue even if rq_noidle = 1
    blkio: Implement group_isolation tunable
    blkio: Determine async workload length based on total number of queues
    blkio: Wait for cfq queue to get backlogged if group is empty
    blkio: Propagate cgroup weight updation to cfq groups
    blkio: Drop the reference to queue once the task changes cgroup
    blkio: Provide some isolation between groups
    ...

    Linus Torvalds
     
  • * 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits)
    KVM: VMX: Fix comparison of guest efer with stale host value
    KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c
    KVM: Drop user return notifier when disabling virtualization on a cpu
    KVM: VMX: Disable unrestricted guest when EPT disabled
    KVM: x86 emulator: limit instructions to 15 bytes
    KVM: s390: Make psw available on all exits, not just a subset
    KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
    KVM: VMX: Report unexpected simultaneous exceptions as internal errors
    KVM: Allow internal errors reported to userspace to carry extra data
    KVM: Reorder IOCTLs in main kvm.h
    KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG
    KVM: only clear irq_source_id if irqchip is present
    KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic
    KVM: x86: disallow multiple KVM_CREATE_IRQCHIP
    KVM: VMX: Remove vmx->msr_offset_efer
    KVM: MMU: update invlpg handler comment
    KVM: VMX: move CR3/PDPTR update to vmx_set_cr3
    KVM: remove duplicated task_switch check
    KVM: powerpc: Fix BUILD_BUG_ON condition
    KVM: VMX: Use shared msr infrastructure
    ...

    Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile

    Linus Torvalds
     

04 Dec, 2009

1 commit

  • With CLONE_IO, parent's io_context->nr_tasks is incremented, but never
    decremented whenever copy_process() fails afterwards, which prevents
    exit_io_context() from calling IO schedulers exit functions.

    Give a task_struct to exit_io_context(), and call exit_io_context() instead of
    put_io_context() in copy_process() cleanup path.

    Signed-off-by: Louis Rilling
    Signed-off-by: Jens Axboe

    Louis Rilling
     

03 Dec, 2009

3 commits

  • Signed-off-by: Avi Kivity

    Avi Kivity
     
  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • - Remove if({u,s}t)s because no one call it with NULL now.
    - Use cputime_{add,sub}().
    - Add ifndef-endif for prev_{u,s}time since they are used
    only when !VIRT_CPU_ACCOUNTING.

    Signed-off-by: Hidetoshi Seto
    Cc: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

30 Nov, 2009

1 commit

  • fork() clones all thread_info flags, including
    TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu
    which doesn't have user return notifiers set, this causes user
    return notifiers to trigger without any way of clearing itself.

    This is easy to trigger with a forky workload on the host in
    parallel with kvm, resulting in a cpu in an endless loop on the
    verge of returning to userspace.

    Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork.

    Signed-off-by: Avi Kivity
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Avi Kivity
     

03 Nov, 2009

1 commit

  • nr_processes() returns the sum of the per cpu counter process_counts for
    all online CPUs. This counter is incremented for the current CPU on
    fork() and decremented for the current CPU on exit(). Since a process
    does not necessarily fork and exit on the same CPU the process_count for
    an individual CPU can be either positive or negative and effectively has
    no meaning in isolation.

    Therefore calculating the sum of process_counts over only the online
    CPUs omits the processes which were started or stopped on any CPU which
    has since been unplugged. Only the sum of process_counts across all
    possible CPUs has meaning.

    The only caller of nr_processes() is proc_root_getattr() which
    calculates the number of links to /proc as
    stat->nlink = proc_root.nlink + nr_processes();

    You don't have to be all that unlucky for the nr_processes() to return a
    negative value leading to a negative number of links (or rather, an
    apparently enormous number of links). If this happens then you can get
    failures where things like "ls /proc" start to fail because they got an
    -EOVERFLOW from some stat() call.

    Example with some debugging inserted to show what goes on:
    # ps haux|wc -l
    nr_processes: CPU0: 90
    nr_processes: CPU1: 1030
    nr_processes: CPU2: -900
    nr_processes: CPU3: -136
    nr_processes: TOTAL: 84
    proc_root_getattr. nlink 12 + nr_processes() 84 = 96
    84
    # echo 0 >/sys/devices/system/cpu/cpu1/online
    # ps haux|wc -l
    nr_processes: CPU0: 85
    nr_processes: CPU2: -901
    nr_processes: CPU3: -137
    nr_processes: TOTAL: -953
    proc_root_getattr. nlink 12 + nr_processes() -953 = -941
    75
    # stat /proc/
    nr_processes: CPU0: 84
    nr_processes: CPU2: -901
    nr_processes: CPU3: -137
    nr_processes: TOTAL: -954
    proc_root_getattr. nlink 12 + nr_processes() -954 = -942
    File: `/proc/'
    Size: 0 Blocks: 0 IO Block: 1024 directory
    Device: 3h/3d Inode: 1 Links: 4294966354
    Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2009-11-03 09:06:55.000000000 +0000
    Modify: 2009-11-03 09:06:55.000000000 +0000
    Change: 2009-11-03 09:06:55.000000000 +0000

    I'm not 100% convinced that the per_cpu regions remain valid for offline
    CPUs, although my testing suggests that they do. If not then I think the
    correct solution would be to aggregate the process_count for a given CPU
    into a global base value in cpu_down().

    This bug appears to pre-date the transition to git and it looks like it
    may even have been present in linux-2.6.0-test7-bk3 since it looks like
    the code Rusty patched in http://lwn.net/Articles/64773/ was already
    wrong.

    Signed-off-by: Ian Campbell
    Cc: Andrew Morton
    Cc: Rusty Russell
    Signed-off-by: Linus Torvalds

    Ian Campbell
     

09 Oct, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    futex: fix requeue_pi key imbalance
    futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions
    rcu: Place root rcu_node structure in separate lockdep class
    rcu: Make hot-unplugged CPU relinquish its own RCU callbacks
    rcu: Move rcu_barrier() to rcutree
    futex: Move exit_pi_state() call to release_mm()
    futex: Nullify robust lists after cleanup
    futex: Fix locking imbalance
    panic: Fix panic message visibility by calling bust_spinlocks(0) before dying
    rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function
    rcu: Clean up code based on review feedback from Josh Triplett, part 4
    rcu: Clean up code based on review feedback from Josh Triplett, part 3
    rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y
    rcu: Clean up code to address Ingo's checkpatch feedback
    rcu: Clean up code based on review feedback from Josh Triplett, part 2
    rcu: Clean up code based on review feedback from Josh Triplett

    Linus Torvalds
     

06 Oct, 2009

2 commits

  • exit_pi_state() is called from do_exit() but not from do_execve().
    Move it to release_mm() so it gets called from do_execve() as well.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Cc: stable@kernel.org
    Cc: Anirban Sinha
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • The robust list pointers of user space held futexes are kept intact
    over an exec() call. When the exec'ed task exits exit_robust_list() is
    called with the stale pointer. The risk of corruption is minimal, but
    still it is incorrect to keep the pointers valid. Actually glibc
    should uninstall the robust list before calling exec() but we have to
    deal with it anyway.

    Nullify the pointers after [compat_]exit_robust_list() has been
    called.

    Reported-by: Anirban Sinha
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Cc: stable@kernel.org

    Peter Zijlstra
     

24 Sep, 2009

4 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • ->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.

    Signed-off-by: Alexey Dobriyan
    Cc: Zach Brown
    Cc: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • When global or container-init processes use CLONE_PARENT, they create a
    multi-rooted process tree. Besides siblings of global init remain as
    zombies on exit since they are not reaped by their parent (swapper). So
    prevent global and container-inits from creating siblings.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Eric W. Biederman
    Acked-by: Roland McGrath
    Cc: Oren Laadan
    Cc: Oleg Nesterov
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    itimers: Add tracepoints for itimer
    hrtimer: Add tracepoint for hrtimers
    timers: Add tracepoints for timer_list timers
    cputime: Optimize jiffies_to_cputime(1)
    itimers: Simplify arm_timer() code a bit
    itimers: Fix periodic tics precision
    itimers: Merge ITIMER_VIRT and ITIMER_PROF

    Trivial header file include conflicts in kernel/fork.c

    Linus Torvalds
     

23 Sep, 2009

2 commits

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     

22 Sep, 2009

5 commits

  • Currently, OOM logic callflow is here.

    __out_of_memory()
    select_bad_process() for each task
    badness() calculate badness of one task
    oom_kill_process() search child
    oom_kill_task() kill target task and mm shared tasks with it

    example, process-A have two thread, thread-A and thread-B and it have very
    fat memory and each thread have following oom_adj and oom_score.

    thread-A: oom_adj = OOM_DISABLE, oom_score = 0
    thread-B: oom_adj = 0, oom_score = very-high

    Then, select_bad_process() select thread-B, but oom_kill_task() refuse
    kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
    call select_bad_process() again. but select_bad_process() select the same
    task. It mean kernel fall in livelock.

    The fact is, select_bad_process() must select killable task. otherwise
    OOM logic go into livelock.

    And root cause is, oom_adj shouldn't be per-thread value. it should be
    per-process value because OOM-killer kill a process, not thread. Thus
    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct signal_struct. it naturally prevent
    select_bad_process() choose wrong task.

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Rawhide users have reported hang at startup when cryptsetup is run: the
    same problem can be simply reproduced by running a program int main() {
    mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

    The problem is that exit_mmap() applies munlock_vma_pages_all() to
    clean up VM_LOCKED areas, and its current implementation (stupidly)
    tries to fault in absent pages, for example where PROT_NONE prevented
    them being faulted in when mlocking. Whereas the "ksm: fix oom
    deadlock" patch, knowing there's a race by which KSM might try to fault
    in pages after exit_mmap() had finally zapped the range, backs out of
    such faults doing nothing when its ksm_test_exit() notices mm_users 0.

    So revert that part of "ksm: fix oom deadlock" which moved the
    ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
    and remove those ksm_test_exit() checks from the page fault paths, so
    allowing the munlocking to proceed without interference.

    ksm_exit, if there are rmap_items still chained on this mm slot, takes
    mmap_sem write side: so preventing KSM from working on an mm while
    exit_mmap runs. And KSM will bail out as soon as it notices that
    mm_users is already zero, thanks to its internal ksm_test_exit checks.
    So that when a task is killed by OOM killer or the user, KSM will not
    indefinitely prevent it from running exit_mmap to release its memory.

    This does break a part of what "ksm: fix oom deadlock" was trying to
    achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
    when ksmd itself has to cancel a KSM page, it is possible that the
    first OOM-kill victim would be the KSM process being faulted: then its
    memory won't be freed until a second victim has been selected (freeing
    memory for the unmerging fault to complete).

    But the OOM killer is already liable to kill a second victim once the
    intended victim's p->mm goes to NULL: so there's not much point in
    rejecting this KSM patch before fixing that OOM behaviour. It is very
    much more important to allow KSM users to boot up, than to haggle over
    an unlikely and poorly supported OOM case.

    We also intend to fix munlocking to not fault pages: at which point
    this patch _could_ be reverted; though that would be controversial, so
    we hope to find a better solution.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Justin M. Forbes
    Acked-for-now-by: Hugh Dickins
    Cc: Izik Eidus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

    Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
    though that made everything else: perhaps use mmap_sem somehow?
    And part of the answer lies in the comments on unmerge_ksm_pages:
    __ksm_exit should also leave all the rmap_item removal to ksmd.

    But there's a fundamental problem, that KSM relies upon mmap_sem to
    guarantee the consistency of the mm it's dealing with, yet exit_mmap
    tears down an mm without taking mmap_sem. And bumping mm_users won't
    help at all, that just ensures that the pages the OOM killer assumes
    are on their way to being freed will not be freed.

    The best answer seems to be, to move the ksm_exit callout from just
    before exit_mmap, to the middle of exit_mmap: after the mm's pages
    have been freed (if the mmu_gather is flushed), but before its page
    tables and vma structures have been freed; and down_write,up_write
    mmap_sem there to serialize with KSM's own reliance on mmap_sem.

    But KSM then needs to be careful, whenever it downs mmap_sem, to
    check that the mm is not already exiting: there's a danger of using
    find_vma on a layout that's being torn apart, or writing into page
    tables which have been freed for reuse; and even do_anonymous_page
    and __do_fault need to check they're not being called by break_ksm
    to reinstate a pte after zap_pte_range has zapped that page table.

    Though it might be clearer to add an exiting flag, set while holding
    mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
    a zapped pte. All we need is to check whether mm_users is 0 - but
    must remember that ksmd may detect that before __ksm_exit is reached.
    So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

    __ksm_exit now has to leave clearing up the rmap_items to ksmd,
    that needs ksm_thread_mutex; but shift the exiting mm just after the
    ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
    mm_count to hold the mm_struct, ksmd's exit processing (exactly like
    its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
    similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

    But also give __ksm_exit a fast path: when there's no complication
    (no rmap_items attached to mm and it's not at the ksm_scan cursor),
    it can safely do all the exiting work itself. This is not just an
    optimization: when ksmd is not running, the raised mm_count would
    otherwise leak mm_structs.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch presents the mm interface to a dummy version of ksm.c, for
    better scrutiny of that interface: the real ksm.c follows later.

    When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
    MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
    pretending that they can be serviced. But when CONFIG_KSM=y, accept them
    even if KSM is not currently running, and even on areas which KSM will not
    touch (e.g. hugetlb or shared file or special driver mappings).

    Like other madvices, report ENOMEM despite success if any area in the
    range is unmapped, and use EAGAIN to report out of memory.

    Define vma flag VM_MERGEABLE to identify an area on which KSM may try
    merging pages: leave it to ksm_madvise() to decide whether to set it.
    Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
    VM_MERGEABLE areas, to minimize callouts when forking or exiting.

    Based upon earlier patches by Chris Wright and Izik Eidus.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Michael Kerrisk
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The amount of memory allocated to kernel stacks can become significant and
    cause OOM conditions. However, we do not display the amount of memory
    consumed by stacks.

    Add code to display the amount of memory used for stacks in /proc/meminfo.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Sep, 2009

1 commit

  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
    rcu: Move end of special early-boot RCU operation earlier
    rcu: Changes from reviews: avoid casts, fix/add warnings, improve comments
    rcu: Create rcutree plugins to handle hotplug CPU for multi-level trees
    rcu: Remove lockdep annotations from RCU's _notrace() API members
    rcu: Add #ifdef to suppress __rcu_offline_cpu() warning in !HOTPLUG_CPU builds
    rcu: Add CPU-offline processing for single-node configurations
    rcu: Add "notrace" to RCU function headers used by ftrace
    rcu: Remove CONFIG_PREEMPT_RCU
    rcu: Merge preemptable-RCU functionality into hierarchical RCU
    rcu: Simplify rcu_pending()/rcu_check_callbacks() API
    rcu: Use debugfs_remove_recursive() simplify code.
    rcu: Merge per-RCU-flavor initialization into pre-existing macro
    rcu: Fix online/offline indication for rcudata.csv trace file
    rcu: Consolidate sparse and lockdep declarations in include/linux/rcupdate.h
    rcu: Renamings to increase RCU clarity
    rcu: Move private definitions from include/linux/rcutree.h to kernel/rcutree.h
    rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMP
    rcu: Delay rcu_barrier() wait until beginning of next CPU-hotunplug operation.
    rcu: Fix typo in rcu_irq_exit() comment header
    rcu: Make rcupreempt_trace.c look at offline CPUs
    ...

    Linus Torvalds
     

11 Sep, 2009

1 commit


04 Sep, 2009

1 commit


02 Sep, 2009

1 commit

  • Add a config option (CONFIG_DEBUG_CREDENTIALS) to turn on some debug checking
    for credential management. The additional code keeps track of the number of
    pointers from task_structs to any given cred struct, and checks to see that
    this number never exceeds the usage count of the cred struct (which includes
    all references, not just those from task_structs).

    Furthermore, if SELinux is enabled, the code also checks that the security
    pointer in the cred struct is never seen to be invalid.

    This attempts to catch the bug whereby inode_has_perm() faults in an nfsd
    kernel thread on seeing cred->security be a NULL pointer (it appears that the
    credential struct has been previously released):

    http://www.kerneloops.org/oops.php?number=252883

    Signed-off-by: David Howells
    Signed-off-by: James Morris

    David Howells
     

29 Aug, 2009

1 commit


27 Aug, 2009

1 commit

  • Spotted by Hiroshi Shimamoto who also provided the test-case below.

    copy_process() uses signal->count as a reference counter, but it is not.
    This test case

    #include
    #include
    #include
    #include
    #include
    #include

    void *null_thread(void *p)
    {
    for (;;)
    sleep(1);

    return NULL;
    }

    void *exec_thread(void *p)
    {
    execl("/bin/true", "/bin/true", NULL);

    return null_thread(p);
    }

    int main(int argc, char **argv)
    {
    for (;;) {
    pid_t pid;
    int ret, status;

    pid = fork();
    if (pid < 0)
    break;

    if (!pid) {
    pthread_t tid;

    pthread_create(&tid, NULL, exec_thread, NULL);
    for (;;)
    pthread_create(&tid, NULL, null_thread, NULL);
    }

    do {
    ret = waitpid(pid, &status, 0);
    } while (ret == -1 && errno == EINTR);
    }

    return 0;
    }

    quickly creates an unkillable task.

    If copy_process(CLONE_THREAD) races with de_thread()
    copy_signal()->atomic(signal->count) breaks the signal->notify_count
    logic, and the execing thread can hang forever in kernel space.

    Change copy_process() to increment count/live only when we know for sure
    we can't fail. In this case the forked thread will take care of its
    reference to signal correctly.

    If copy_process() fails, check CLONE_THREAD flag. If it it set - do
    nothing, the counters were not changed and current belongs to the same
    thread group. If it is not set, ->signal must be released in any case
    (and ->count must be == 1), the forked child is the only thread in the
    thread group.

    We need more cleanups here, in particular signal->count should not be used
    by de_thread/__exit_signal at all. This patch only fixes the bug.

    Reported-by: Hiroshi Shimamoto
    Tested-by: Hiroshi Shimamoto
    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

23 Aug, 2009

1 commit

  • Create a kernel/rcutree_plugin.h file that contains definitions
    for preemptable RCU (or, under the #else branch of the #ifdef,
    empty definitions for the classic non-preemptable semantics).
    These definitions fit into plugins defined in kernel/rcutree.c
    for this purpose.

    This variant of preemptable RCU uses a new algorithm whose
    read-side expense is roughly that of classic hierarchical RCU
    under CONFIG_PREEMPT. This new algorithm's update-side expense
    is similar to that of classic hierarchical RCU, and, in absence
    of read-side preemption or blocking, is exactly that of classic
    hierarchical RCU. Perhaps more important, this new algorithm
    has a much simpler implementation, saving well over 1,000 lines
    of code compared to mainline's implementation of preemptable
    RCU, which will hopefully be retired in favor of this new
    algorithm.

    The simplifications are obtained by maintaining per-task
    nesting state for running tasks, and using a simple
    lock-protected algorithm to handle accounting when tasks block
    within RCU read-side critical sections, making use of lessons
    learned while creating numerous user-level RCU implementations
    over the past 18 months.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: akpm@linux-foundation.org
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josht@linux.vnet.ibm.com
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

19 Aug, 2009

1 commit

  • The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
    the mm_struct. It was a very good first step for sanitize OOM.

    However Paul Menage reported the commit makes regression to his job
    scheduler. Current OOM logic can kill OOM_DISABLED process.

    Why? His program has the code of similar to the following.

    ...
    set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
    ...
    if (vfork() == 0) {
    set_oom_adj(0); /* Invoked child can be killed */
    execve("foo-bar-cmd");
    }
    ....

    vfork() parent and child are shared the same mm_struct. then above
    set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
    change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
    lost OOM immune and it was killed.

    Actually, fork-setting-exec idiom is very frequently used in userland program.
    We must not break this assumption.

    Then, this patch revert commit 2ff05b2b and related commit.

    Reverted commit list
    ---------------------
    - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
    - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
    - commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
    - commit 933b787b57 (mm: copy over oom_adj value at fork time)

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

08 Aug, 2009

1 commit

  • While looking at Jens Rosenboom bug report
    (http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
    a dying "ps" program, we found following problem.

    clone() syscall has special support for TID of created threads. This
    support includes two features.

    One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
    TID value.

    One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
    thread dies.

    The integer location is a user provided pointer, provided at clone()
    time.

    kernel keeps this pointer value into current->clear_child_tid.

    At execve() time, we should make sure kernel doesnt keep this user
    provided pointer, as full user memory is replaced by a new one.

    As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
    CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
    memory in forked processes.

    Following sequence could happen:

    1) bash (or any program) starts a new process, by a fork() call that
    glibc maps to a clone( ... CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
    ...) syscall

    2) When new process starts, its current->clear_child_tid is set to a
    location that has a meaning only in bash (or initial program) context
    (&THREAD_SELF->tid)

    3) This new process does the execve() syscall to start a new program.
    current->clear_child_tid is left unchanged (a non NULL value)

    4) If this new program creates some threads, and initial thread exits,
    kernel will attempt to clear the integer pointed by
    current->clear_child_tid from mm_release() :

    if (tsk->clear_child_tid
    && !(tsk->flags & PF_SIGNALED)
    && atomic_read(&mm->mm_users) > 1) {
    u32 __user * tidptr = tsk->clear_child_tid;
    tsk->clear_child_tid = NULL;

    /*
    * We don't check the error code - if userspace has
    * not set up a proper pointer then tough luck.
    */
    << here >> put_user(0, tidptr);
    sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
    }

    5) OR : if new program is not multi-threaded, but spied by /proc/pid
    users (ps command for example), mm_users > 1, and the exiting program
    could corrupt 4 bytes in a persistent memory area (shm or memory mapped
    file)

    If current->clear_child_tid points to a writeable portion of memory of the
    new program, kernel happily and silently corrupts 4 bytes of memory, with
    unexpected effects.

    Fix is straightforward and should not break any sane program.

    Reported-by: Jens Rosenboom
    Acked-by: Linus Torvalds
    Signed-off-by: Eric Dumazet
    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Sonny Rao
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ulrich Drepper
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

03 Aug, 2009

1 commit

  • Both cpu itimers have same data flow in the few places, this
    patch make unification of code related with VIRT and PROF
    itimers.

    Signed-off-by: Stanislaw Gruszka
    Acked-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

02 Aug, 2009

1 commit

  • In order to be able to distinguish between no samples due to
    inactivity and no samples due to task ended, Arjan asked for
    PERF_EVENT_EXIT events. This is useful to the boot delay
    instrumentation (bootchart) app.

    This patch changes the PERF_EVENT_FORK to be emitted on every
    clone, and adds PERF_EVENT_EXIT to be emitted on task exit,
    after the task's counters have been closed.

    This task tracing is controlled through: attr.comm || attr.mmap
    and through the new attr.task field.

    Suggested-by: Arjan van de Ven
    Cc: Paul Mackerras
    Cc: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    [ cleaned up perf_counter.h a bit ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 Jul, 2009

1 commit

  • Fix a post-2.6.31 regression which was introduced by
    2ff05b2b4eac2e63d345fc731ea151a060247f53 ("oom: move oom_adj value from
    task_struct to mm_struct").

    After moving the oom_adj value from the task struct to the mm_struct, the
    oom_adj value was no longer properly inherited by child processes.

    Copying over the oom_adj value at fork time fixes that bug.

    [kosaki.motohiro@jp.fujitsu.com: test for current->mm before dereferencing it]
    Signed-off-by: Rik van Riel
    Reported-by: Paul Menage
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

23 Jul, 2009

1 commit

  • …nel/git/peterz/linux-2.6-perf

    * 'perf-counters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf: (31 commits)
    perf_counter tools: Give perf top inherit option
    perf_counter tools: Fix vmlinux symbol generation breakage
    perf_counter: Detect debugfs location
    perf_counter: Add tracepoint support to perf list, perf stat
    perf symbol: C++ demangling
    perf: avoid structure size confusion by using a fixed size
    perf_counter: Fix throttle/unthrottle event logging
    perf_counter: Improve perf stat and perf record option parsing
    perf_counter: PERF_SAMPLE_ID and inherited counters
    perf_counter: Plug more stack leaks
    perf: Fix stack data leak
    perf_counter: Remove unused variables
    perf_counter: Make call graph option consistent
    perf_counter: Add perf record option to log addresses
    perf_counter: Log vfork as a fork event
    perf_counter: Synthesize VDSO mmap event
    perf_counter: Make sure we dont leak kernel memory to userspace
    perf_counter tools: Fix index boundary check
    perf_counter: Fix the tracepoint channel to perfcounters
    perf_counter, x86: Extend perf_counter Pentium M support
    ...

    Linus Torvalds