07 Apr, 2010

1 commit

  • - We weren't zeroing p->rss_stat[] at fork()

    - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
    threads and was oopsing.

    - Make __sync_task_rss_stat() static, too.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

    [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
    Reported-by: Troels Liebe Bentsen
    Signed-off-by: KAMEZAWA Hiroyuki
    "Michael S. Tsirkin"
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     

13 Mar, 2010

2 commits


07 Mar, 2010

3 commits

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Mar, 2010

1 commit

  • Lockdep-RCU commit d11c563d exported tasklist_lock, which is not
    a good thing. This patch instead exports a function that uses
    lockdep to check whether tasklist_lock is held.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Christoph Hellwig
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

25 Feb, 2010

1 commit

  • Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

22 Jan, 2010

1 commit

  • There are a number of issues:

    1) TASK_WAKING vs cgroup_clone (cpusets)

    copy_process():

    sched_fork()
    child->state = TASK_WAKING; /* waiting for wake_up_new_task() */
    if (current->nsproxy != p->nsproxy)
    ns_cgroup_clone()
    cgroup_clone()
    mutex_lock(inode->i_mutex)
    mutex_lock(cgroup_mutex)
    cgroup_attach_task()
    ss->can_attach()
    ss->attach() [ -> cpuset_attach() ]
    cpuset_attach_task()
    set_cpus_allowed_ptr();
    while (child->state == TASK_WAKING)
    cpu_relax();
    will deadlock the system.

    2) cgroup_clone (cpusets) vs copy_process

    So even if the above would work we still have:

    copy_process():

    if (current->nsproxy != p->nsproxy)
    ns_cgroup_clone()
    cgroup_clone()
    mutex_lock(inode->i_mutex)
    mutex_lock(cgroup_mutex)
    cgroup_attach_task()
    ss->can_attach()
    ss->attach() [ -> cpuset_attach() ]
    cpuset_attach_task()
    set_cpus_allowed_ptr();
    ...

    p->cpus_allowed = current->cpus_allowed

    over-writing the modified cpus_allowed.

    3) fork() vs hotplug

    if we unplug the child's cpu after the sanity check when the child
    gets attached to the task_list but before wake_up_new_task() shit
    will meet with fan.

    Solve all these issues by moving fork cpu selection into
    wake_up_new_task().

    Reported-by: Serge E. Hallyn
    Tested-by: Serge E. Hallyn
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

18 Dec, 2009

1 commit

  • Thanks to Roland who pointed out de_thread() issues.

    Currently we add sub-threads to ->real_parent->children list. This buys
    nothing but slows down do_wait().

    With this patch ->children contains only main threads (group leaders).
    The only complication is that forget_original_parent() should iterate over
    sub-threads by hand, and de_thread() needs another list_replace() when it
    changes ->group_leader.

    Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
    tasks, we can remove this check (and we can unify do_wait_thread() and
    ptrace_do_wait()).

    This change can confuse the optimistic search in mm_update_next_owner(),
    but this is fixable and minor.

    Perhaps badness() and oom_kill_process() should be updated, but they
    should be fixed in any case.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Dec, 2009

2 commits

  • If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child
    starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent.
    This is not right, especially when the new child is not auto-attaced: in
    this case it is killed by SIGTRAP.

    Change copy_process() to call user_disable_single_step(). Tested on x86.

    Test-case:

    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    int pid, status;

    if (!(pid = fork())) {
    assert(ptrace(PTRACE_TRACEME) == 0);
    kill(getpid(), SIGSTOP);

    if (!fork()) {
    /* kernel bug: this child will be killed by SIGTRAP */
    printf("Hello world\n");
    return 43;
    }

    wait(&status);
    return WEXITSTATUS(status);
    }

    for (;;) {
    assert(pid == wait(&status));
    if (WIFEXITED(status))
    break;
    assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
    }

    assert(WEXITSTATUS(status) == 43);
    return 0;
    }

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • In massive parallel enviroment, res_counter can be a performance
    bottleneck. One strong techinque to reduce lock contention is reducing
    calls by coalescing some amount of calls into one.

    Considering charge/uncharge chatacteristic,
    - charge is done one by one via demand-paging.
    - uncharge is done by
    - in chunk at munmap, truncate, exit, execve...
    - one by one via vmscan/paging.

    It seems we have a chance to coalesce uncharges for improving scalability
    at unmap/truncation.

    This patch is a for coalescing uncharge. For avoiding scattering memcg's
    structure to functions under /mm, this patch adds memcg batch uncharge
    information to the task. A reason for per-task batching is for making use
    of caller's context information. We do batched uncharge (deleyed
    uncharge) when truncation/unmap occurs but do direct uncharge when
    uncharge is called by memory reclaim (vmscan.c).

    The degree of coalescing depends on callers
    - at invalidate/trucate... pagevec size
    - at unmap ....ZAP_BLOCK_SIZE
    (memory itself will be freed in this degree.)
    Then, we'll not coalescing too much.

    On x86-64 8cpu server, I tested overheads of memcg at page fault by
    running a program which does map/fault/unmap in a loop. Running
    a task per a cpu by taskset and see sum of the number of page faults
    in 60secs.

    [without memcg config]
    40156968 page-faults # 0.085 M/sec ( +- 0.046% )
    27.67 cache-miss/faults
    [root cgroup]
    36659599 page-faults # 0.077 M/sec ( +- 0.247% )
    31.58 miss/faults
    [in a child cgroup]
    18444157 page-faults # 0.039 M/sec ( +- 0.133% )
    69.96 miss/faults
    [child with this patch]
    27133719 page-faults # 0.057 M/sec ( +- 0.155% )
    47.16 miss/faults

    We can see some amounts of improvement.
    (root cgroup doesn't affected by this patch)
    Another patch for "charge" will follow this and above will be improved more.

    Changelog(since 2009/10/02):
    - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
    - some clean up and commentary/description updates.
    - added initialize code to copy_process(). (possible bug fix)

    Changelog(old):
    - fixed !CONFIG_MEM_CGROUP case.
    - rebased onto the latest mmotm + softlimit fix patches.
    - unified patch for callers
    - added commetns.
    - make ->do_batch as bool.
    - removed css_get() at el. We don't need it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

15 Dec, 2009

1 commit


09 Dec, 2009

2 commits

  • * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
    cfq-iosched: Do not access cfqq after freeing it
    block: include linux/err.h to use ERR_PTR
    cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
    blkio: Allow CFQ group IO scheduling even when CFQ is a module
    blkio: Implement dynamic io controlling policy registration
    blkio: Export some symbols from blkio as its user CFQ can be a module
    block: Fix io_context leak after failure of clone with CLONE_IO
    block: Fix io_context leak after clone with CLONE_IO
    cfq-iosched: make nonrot check logic consistent
    io controller: quick fix for blk-cgroup and modular CFQ
    cfq-iosched: move IO controller declerations to a header file
    cfq-iosched: fix compile problem with !CONFIG_CGROUP
    blkio: Documentation
    blkio: Wait on sync-noidle queue even if rq_noidle = 1
    blkio: Implement group_isolation tunable
    blkio: Determine async workload length based on total number of queues
    blkio: Wait for cfq queue to get backlogged if group is empty
    blkio: Propagate cgroup weight updation to cfq groups
    blkio: Drop the reference to queue once the task changes cgroup
    blkio: Provide some isolation between groups
    ...

    Linus Torvalds
     
  • * 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits)
    KVM: VMX: Fix comparison of guest efer with stale host value
    KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c
    KVM: Drop user return notifier when disabling virtualization on a cpu
    KVM: VMX: Disable unrestricted guest when EPT disabled
    KVM: x86 emulator: limit instructions to 15 bytes
    KVM: s390: Make psw available on all exits, not just a subset
    KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
    KVM: VMX: Report unexpected simultaneous exceptions as internal errors
    KVM: Allow internal errors reported to userspace to carry extra data
    KVM: Reorder IOCTLs in main kvm.h
    KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG
    KVM: only clear irq_source_id if irqchip is present
    KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic
    KVM: x86: disallow multiple KVM_CREATE_IRQCHIP
    KVM: VMX: Remove vmx->msr_offset_efer
    KVM: MMU: update invlpg handler comment
    KVM: VMX: move CR3/PDPTR update to vmx_set_cr3
    KVM: remove duplicated task_switch check
    KVM: powerpc: Fix BUILD_BUG_ON condition
    KVM: VMX: Use shared msr infrastructure
    ...

    Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile

    Linus Torvalds
     

04 Dec, 2009

1 commit

  • With CLONE_IO, parent's io_context->nr_tasks is incremented, but never
    decremented whenever copy_process() fails afterwards, which prevents
    exit_io_context() from calling IO schedulers exit functions.

    Give a task_struct to exit_io_context(), and call exit_io_context() instead of
    put_io_context() in copy_process() cleanup path.

    Signed-off-by: Louis Rilling
    Signed-off-by: Jens Axboe

    Louis Rilling
     

03 Dec, 2009

3 commits

  • Signed-off-by: Avi Kivity

    Avi Kivity
     
  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • - Remove if({u,s}t)s because no one call it with NULL now.
    - Use cputime_{add,sub}().
    - Add ifndef-endif for prev_{u,s}time since they are used
    only when !VIRT_CPU_ACCOUNTING.

    Signed-off-by: Hidetoshi Seto
    Cc: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

30 Nov, 2009

1 commit

  • fork() clones all thread_info flags, including
    TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu
    which doesn't have user return notifiers set, this causes user
    return notifiers to trigger without any way of clearing itself.

    This is easy to trigger with a forky workload on the host in
    parallel with kvm, resulting in a cpu in an endless loop on the
    verge of returning to userspace.

    Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork.

    Signed-off-by: Avi Kivity
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Avi Kivity
     

03 Nov, 2009

1 commit

  • nr_processes() returns the sum of the per cpu counter process_counts for
    all online CPUs. This counter is incremented for the current CPU on
    fork() and decremented for the current CPU on exit(). Since a process
    does not necessarily fork and exit on the same CPU the process_count for
    an individual CPU can be either positive or negative and effectively has
    no meaning in isolation.

    Therefore calculating the sum of process_counts over only the online
    CPUs omits the processes which were started or stopped on any CPU which
    has since been unplugged. Only the sum of process_counts across all
    possible CPUs has meaning.

    The only caller of nr_processes() is proc_root_getattr() which
    calculates the number of links to /proc as
    stat->nlink = proc_root.nlink + nr_processes();

    You don't have to be all that unlucky for the nr_processes() to return a
    negative value leading to a negative number of links (or rather, an
    apparently enormous number of links). If this happens then you can get
    failures where things like "ls /proc" start to fail because they got an
    -EOVERFLOW from some stat() call.

    Example with some debugging inserted to show what goes on:
    # ps haux|wc -l
    nr_processes: CPU0: 90
    nr_processes: CPU1: 1030
    nr_processes: CPU2: -900
    nr_processes: CPU3: -136
    nr_processes: TOTAL: 84
    proc_root_getattr. nlink 12 + nr_processes() 84 = 96
    84
    # echo 0 >/sys/devices/system/cpu/cpu1/online
    # ps haux|wc -l
    nr_processes: CPU0: 85
    nr_processes: CPU2: -901
    nr_processes: CPU3: -137
    nr_processes: TOTAL: -953
    proc_root_getattr. nlink 12 + nr_processes() -953 = -941
    75
    # stat /proc/
    nr_processes: CPU0: 84
    nr_processes: CPU2: -901
    nr_processes: CPU3: -137
    nr_processes: TOTAL: -954
    proc_root_getattr. nlink 12 + nr_processes() -954 = -942
    File: `/proc/'
    Size: 0 Blocks: 0 IO Block: 1024 directory
    Device: 3h/3d Inode: 1 Links: 4294966354
    Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2009-11-03 09:06:55.000000000 +0000
    Modify: 2009-11-03 09:06:55.000000000 +0000
    Change: 2009-11-03 09:06:55.000000000 +0000

    I'm not 100% convinced that the per_cpu regions remain valid for offline
    CPUs, although my testing suggests that they do. If not then I think the
    correct solution would be to aggregate the process_count for a given CPU
    into a global base value in cpu_down().

    This bug appears to pre-date the transition to git and it looks like it
    may even have been present in linux-2.6.0-test7-bk3 since it looks like
    the code Rusty patched in http://lwn.net/Articles/64773/ was already
    wrong.

    Signed-off-by: Ian Campbell
    Cc: Andrew Morton
    Cc: Rusty Russell
    Signed-off-by: Linus Torvalds

    Ian Campbell
     

09 Oct, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    futex: fix requeue_pi key imbalance
    futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions
    rcu: Place root rcu_node structure in separate lockdep class
    rcu: Make hot-unplugged CPU relinquish its own RCU callbacks
    rcu: Move rcu_barrier() to rcutree
    futex: Move exit_pi_state() call to release_mm()
    futex: Nullify robust lists after cleanup
    futex: Fix locking imbalance
    panic: Fix panic message visibility by calling bust_spinlocks(0) before dying
    rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function
    rcu: Clean up code based on review feedback from Josh Triplett, part 4
    rcu: Clean up code based on review feedback from Josh Triplett, part 3
    rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y
    rcu: Clean up code to address Ingo's checkpatch feedback
    rcu: Clean up code based on review feedback from Josh Triplett, part 2
    rcu: Clean up code based on review feedback from Josh Triplett

    Linus Torvalds
     

06 Oct, 2009

2 commits

  • exit_pi_state() is called from do_exit() but not from do_execve().
    Move it to release_mm() so it gets called from do_execve() as well.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Cc: stable@kernel.org
    Cc: Anirban Sinha
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • The robust list pointers of user space held futexes are kept intact
    over an exec() call. When the exec'ed task exits exit_robust_list() is
    called with the stale pointer. The risk of corruption is minimal, but
    still it is incorrect to keep the pointers valid. Actually glibc
    should uninstall the robust list before calling exec() but we have to
    deal with it anyway.

    Nullify the pointers after [compat_]exit_robust_list() has been
    called.

    Reported-by: Anirban Sinha
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Cc: stable@kernel.org

    Peter Zijlstra
     

24 Sep, 2009

4 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • ->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.

    Signed-off-by: Alexey Dobriyan
    Cc: Zach Brown
    Cc: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • When global or container-init processes use CLONE_PARENT, they create a
    multi-rooted process tree. Besides siblings of global init remain as
    zombies on exit since they are not reaped by their parent (swapper). So
    prevent global and container-inits from creating siblings.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Eric W. Biederman
    Acked-by: Roland McGrath
    Cc: Oren Laadan
    Cc: Oleg Nesterov
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    itimers: Add tracepoints for itimer
    hrtimer: Add tracepoint for hrtimers
    timers: Add tracepoints for timer_list timers
    cputime: Optimize jiffies_to_cputime(1)
    itimers: Simplify arm_timer() code a bit
    itimers: Fix periodic tics precision
    itimers: Merge ITIMER_VIRT and ITIMER_PROF

    Trivial header file include conflicts in kernel/fork.c

    Linus Torvalds
     

23 Sep, 2009

2 commits

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     

22 Sep, 2009

5 commits

  • Currently, OOM logic callflow is here.

    __out_of_memory()
    select_bad_process() for each task
    badness() calculate badness of one task
    oom_kill_process() search child
    oom_kill_task() kill target task and mm shared tasks with it

    example, process-A have two thread, thread-A and thread-B and it have very
    fat memory and each thread have following oom_adj and oom_score.

    thread-A: oom_adj = OOM_DISABLE, oom_score = 0
    thread-B: oom_adj = 0, oom_score = very-high

    Then, select_bad_process() select thread-B, but oom_kill_task() refuse
    kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
    call select_bad_process() again. but select_bad_process() select the same
    task. It mean kernel fall in livelock.

    The fact is, select_bad_process() must select killable task. otherwise
    OOM logic go into livelock.

    And root cause is, oom_adj shouldn't be per-thread value. it should be
    per-process value because OOM-killer kill a process, not thread. Thus
    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct signal_struct. it naturally prevent
    select_bad_process() choose wrong task.

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Rawhide users have reported hang at startup when cryptsetup is run: the
    same problem can be simply reproduced by running a program int main() {
    mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

    The problem is that exit_mmap() applies munlock_vma_pages_all() to
    clean up VM_LOCKED areas, and its current implementation (stupidly)
    tries to fault in absent pages, for example where PROT_NONE prevented
    them being faulted in when mlocking. Whereas the "ksm: fix oom
    deadlock" patch, knowing there's a race by which KSM might try to fault
    in pages after exit_mmap() had finally zapped the range, backs out of
    such faults doing nothing when its ksm_test_exit() notices mm_users 0.

    So revert that part of "ksm: fix oom deadlock" which moved the
    ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
    and remove those ksm_test_exit() checks from the page fault paths, so
    allowing the munlocking to proceed without interference.

    ksm_exit, if there are rmap_items still chained on this mm slot, takes
    mmap_sem write side: so preventing KSM from working on an mm while
    exit_mmap runs. And KSM will bail out as soon as it notices that
    mm_users is already zero, thanks to its internal ksm_test_exit checks.
    So that when a task is killed by OOM killer or the user, KSM will not
    indefinitely prevent it from running exit_mmap to release its memory.

    This does break a part of what "ksm: fix oom deadlock" was trying to
    achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
    when ksmd itself has to cancel a KSM page, it is possible that the
    first OOM-kill victim would be the KSM process being faulted: then its
    memory won't be freed until a second victim has been selected (freeing
    memory for the unmerging fault to complete).

    But the OOM killer is already liable to kill a second victim once the
    intended victim's p->mm goes to NULL: so there's not much point in
    rejecting this KSM patch before fixing that OOM behaviour. It is very
    much more important to allow KSM users to boot up, than to haggle over
    an unlikely and poorly supported OOM case.

    We also intend to fix munlocking to not fault pages: at which point
    this patch _could_ be reverted; though that would be controversial, so
    we hope to find a better solution.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Justin M. Forbes
    Acked-for-now-by: Hugh Dickins
    Cc: Izik Eidus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

    Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
    though that made everything else: perhaps use mmap_sem somehow?
    And part of the answer lies in the comments on unmerge_ksm_pages:
    __ksm_exit should also leave all the rmap_item removal to ksmd.

    But there's a fundamental problem, that KSM relies upon mmap_sem to
    guarantee the consistency of the mm it's dealing with, yet exit_mmap
    tears down an mm without taking mmap_sem. And bumping mm_users won't
    help at all, that just ensures that the pages the OOM killer assumes
    are on their way to being freed will not be freed.

    The best answer seems to be, to move the ksm_exit callout from just
    before exit_mmap, to the middle of exit_mmap: after the mm's pages
    have been freed (if the mmu_gather is flushed), but before its page
    tables and vma structures have been freed; and down_write,up_write
    mmap_sem there to serialize with KSM's own reliance on mmap_sem.

    But KSM then needs to be careful, whenever it downs mmap_sem, to
    check that the mm is not already exiting: there's a danger of using
    find_vma on a layout that's being torn apart, or writing into page
    tables which have been freed for reuse; and even do_anonymous_page
    and __do_fault need to check they're not being called by break_ksm
    to reinstate a pte after zap_pte_range has zapped that page table.

    Though it might be clearer to add an exiting flag, set while holding
    mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
    a zapped pte. All we need is to check whether mm_users is 0 - but
    must remember that ksmd may detect that before __ksm_exit is reached.
    So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

    __ksm_exit now has to leave clearing up the rmap_items to ksmd,
    that needs ksm_thread_mutex; but shift the exiting mm just after the
    ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
    mm_count to hold the mm_struct, ksmd's exit processing (exactly like
    its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
    similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

    But also give __ksm_exit a fast path: when there's no complication
    (no rmap_items attached to mm and it's not at the ksm_scan cursor),
    it can safely do all the exiting work itself. This is not just an
    optimization: when ksmd is not running, the raised mm_count would
    otherwise leak mm_structs.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch presents the mm interface to a dummy version of ksm.c, for
    better scrutiny of that interface: the real ksm.c follows later.

    When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
    MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
    pretending that they can be serviced. But when CONFIG_KSM=y, accept them
    even if KSM is not currently running, and even on areas which KSM will not
    touch (e.g. hugetlb or shared file or special driver mappings).

    Like other madvices, report ENOMEM despite success if any area in the
    range is unmapped, and use EAGAIN to report out of memory.

    Define vma flag VM_MERGEABLE to identify an area on which KSM may try
    merging pages: leave it to ksm_madvise() to decide whether to set it.
    Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
    VM_MERGEABLE areas, to minimize callouts when forking or exiting.

    Based upon earlier patches by Chris Wright and Izik Eidus.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Michael Kerrisk
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The amount of memory allocated to kernel stacks can become significant and
    cause OOM conditions. However, we do not display the amount of memory
    consumed by stacks.

    Add code to display the amount of memory used for stacks in /proc/meminfo.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Sep, 2009

1 commit

  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
    rcu: Move end of special early-boot RCU operation earlier
    rcu: Changes from reviews: avoid casts, fix/add warnings, improve comments
    rcu: Create rcutree plugins to handle hotplug CPU for multi-level trees
    rcu: Remove lockdep annotations from RCU's _notrace() API members
    rcu: Add #ifdef to suppress __rcu_offline_cpu() warning in !HOTPLUG_CPU builds
    rcu: Add CPU-offline processing for single-node configurations
    rcu: Add "notrace" to RCU function headers used by ftrace
    rcu: Remove CONFIG_PREEMPT_RCU
    rcu: Merge preemptable-RCU functionality into hierarchical RCU
    rcu: Simplify rcu_pending()/rcu_check_callbacks() API
    rcu: Use debugfs_remove_recursive() simplify code.
    rcu: Merge per-RCU-flavor initialization into pre-existing macro
    rcu: Fix online/offline indication for rcudata.csv trace file
    rcu: Consolidate sparse and lockdep declarations in include/linux/rcupdate.h
    rcu: Renamings to increase RCU clarity
    rcu: Move private definitions from include/linux/rcutree.h to kernel/rcutree.h
    rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMP
    rcu: Delay rcu_barrier() wait until beginning of next CPU-hotunplug operation.
    rcu: Fix typo in rcu_irq_exit() comment header
    rcu: Make rcupreempt_trace.c look at offline CPUs
    ...

    Linus Torvalds
     

11 Sep, 2009

1 commit


04 Sep, 2009

1 commit