Eric Lee / smarc-fsl-linux-kernel

16 Mar, 2010

1 commit

1d199b1ad perf: Fix unexported generic perf_arch_fetch_caller_regs ... Browse Code »

perf_arch_fetch_caller_regs() is exported for the overriden x86
version, but not for the generic weak version.

As a general rule, weak functions should not have their symbol
exported in the same file they are defined.

So let's export it on trace_event_perf.c as it is used by trace
events only.

This fixes:

ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined!
ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined!

-v2: And also only build it if trace events are enabled.
-v3: Fix changelog mistake

Reported-by: Stephen Rothwell
Signed-off-by: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Xiao Guangrong
Cc: Paul Mackerras
LKML-Reference:
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2010-03-16 16:27:27 +0800

12 Mar, 2010

1 commit

937779db1 Merge branch 'perf/urgent' into perf/core ... Browse Code »

Merge reason: We want to queue up a dependent patch.

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-03-12 17:20:59 +0800

11 Mar, 2010

4 commits

9b33fa6ba perf_events: Improve task_sched_in() ... Browse Code »

This patch is an optimization in perf_event_task_sched_in() to avoid
scheduling the events twice in a row.

Without it, the perf_disable()/perf_enable() pair is invoked twice,
thereby pinned events counts while scheduling flexible events and we go
throuh hw_perf_enable() twice.

By encapsulating, the whole sequence into perf_disable()/perf_enable() we
ensure, hw_perf_enable() is going to be invoked only once because of the
refcount protection.

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

eranian@google.com
2010-03-11 22:23:28 +0800
639fe4b12 perf: export perf_trace_regs and perf_arch_fetch_caller_regs ... Browse Code »

Export perf_trace_regs and perf_arch_fetch_caller_regs since module will
use these.

Signed-off-by: Xiao Guangrong
[ use EXPORT_PER_CPU_SYMBOL_GPL() ]
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Xiao Guangrong
2010-03-11 22:21:29 +0800
83ff56f46 kprobes: Calculate the index correctly when freeing the out-of-line execution slot ... Browse Code »

From : Ananth N Mavinakayanahalli

When freeing the instruction slot, the arithmetic to calculate
the index of the slot in the page needs to account for the total
size of the instruction on the various architectures.

Calculate the index correctly when freeing the out-of-line
execution slot.

Reported-by: Sachin Sant
Reported-by: Heiko Carstens
Signed-off-by: Ananth N Mavinakayanahalli
Signed-off-by: Masami Hiramatsu
LKML-Reference:
Signed-off-by: Ingo Molnar

Masami Hiramatsu
2010-03-11 21:06:16 +0800
220b140b5 perf_event: Fix oops triggered by cpu offline/online ... Browse Code »

Anton Blanchard found that he could reliably make the kernel hit a
BUG_ON in the slab allocator by taking a cpu offline and then online
while a system-wide perf record session was running.

The reason is that when the cpu comes up, we completely reinitialize
the ctx field of the struct perf_cpu_context for the cpu. If there is
a system-wide perf record session running, then there will be a struct
perf_event that has a reference to the context, so its refcount will
be 2. (The perf_event has been removed from the context's group_entry
and event_entry lists by perf_event_exit_cpu(), but that doesn't
remove the perf_event's reference to the context and doesn't decrement
the context's refcount.)

When the cpu comes up, perf_event_init_cpu() gets called, and it calls
__perf_event_init_context() on the cpu's context. That resets the
refcount to 1. Then when the perf record session finishes and the
perf_event is closed, the refcount gets decremented to 0 and the
context gets kfreed after an RCU grace period. Since the context
wasn't kmalloced -- it's part of a per-cpu variable -- bad things
happen.

In fact we don't need to completely reinitialize the context when the
cpu comes up. It's sufficient to initialize the context once at boot,
but we need to do it for all possible cpus.

This moves the context initialization to happen at boot time. With
this, we don't trash the refcount and the context never gets kfreed,
and we don't hit the BUG_ON.

Reported-by: Anton Blanchard
Signed-off-by: Paul Mackerras
Tested-by: Anton Blanchard
Acked-by: Peter Zijlstra
Cc:
Signed-off-by: Ingo Molnar

Paul Mackerras
2010-03-11 19:43:51 +0800

10 Mar, 2010

9 commits

97d5a2200 perf: Drop the obsolete profile naming for trace events ... Browse Code »

Drop the obsolete "profile" naming used by perf for trace events.
Perf can now do more than simple events counting, so generalize
the API naming.

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Steven Rostedt
Cc: Masami Hiramatsu
Cc: Jason Baron

Frederic Weisbecker
2010-03-10 21:47:18 +0800
c530665c3 perf: Take a hot regs snapshot for trace events ... Browse Code »

We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_fetch_caller_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

perf [kernel] [k] __do_softirq
|
--- __do_softirq
|
|--55.16%-- __open
|
--44.84%-- __write_nocancel

After:

perf [kernel] [k] perf_tp_event
|
--- perf_tp_event
|
|--41.07%-- lock_acquire
| |
| |--39.36%-- _raw_spin_lock
| | |
| | |--7.81%-- hrtimer_interrupt
| | | smp_apic_timer_interrupt
| | | apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs()

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: H. Peter Anvin
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Steven Rostedt
Cc: Arnaldo Carvalho de Melo
Cc: Masami Hiramatsu
Cc: Jason Baron
Cc: Archs

Frederic Weisbecker
2010-03-10 21:40:38 +0800
5331d7b84 perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot ... Browse Code »

Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

v2: rename perf_save_regs to perf_fetch_caller_regs as per
Masami's suggestion.

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: H. Peter Anvin
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Steven Rostedt
Cc: Arnaldo Carvalho de Melo
Cc: Masami Hiramatsu
Cc: Jason Baron
Cc: Archs

Frederic Weisbecker
2010-03-10 21:39:35 +0800
db2c4c779 lockdep: Move lock events under lockdep recursion protection ... Browse Code »

There are rcu locked read side areas in the path where we submit
a trace event. And these rcu_read_(un)lock() trigger lock events,
which create recursive events.

One pair in do_perf_sw_event:

__lock_acquire
|
|--96.11%-- lock_acquire
| |
| |--27.21%-- do_perf_sw_event
| | perf_tp_event
| | |
| | |--49.62%-- ftrace_profile_lock_release
| | | lock_release
| | | |
| | | |--33.85%-- _raw_spin_unlock

Another pair in perf_output_begin/end:

__lock_acquire
|--23.40%-- perf_output_begin
| | __perf_event_overflow
| | perf_swevent_overflow
| | perf_swevent_add
| | perf_swevent_ctx_event
| | do_perf_sw_event
| | perf_tp_event
| | |
| | |--55.37%-- ftrace_profile_lock_acquire
| | | lock_acquire
| | | |
| | | |--37.31%-- _raw_spin_lock

The problem is not that much the trace recursion itself, as we have a
recursion protection already (though it's always wasteful to recurse).
But the trace events are outside the lockdep recursion protection, then
each lockdep event triggers a lock trace, which will trigger two
other lockdep events. Here the recursive lock trace event won't
be taken because of the trace recursion, so the recursion stops there
but lockdep will still analyse these new events:

To sum up, for each lockdep events we have:

lock_*()
|
trace lock_acquire
|
----- rcu_read_lock()
| |
| lock_acquire()
| |
| trace_lock_acquire() (stopped)
| |
| lockdep analyze
|
----- rcu_read_unlock()
|
lock_release
|
trace_lock_release() (stopped)
|
lockdep analyze

And you can repeat the above two times as we have two rcu read side
sections when we submit an event.

This is fixed in this patch by moving the lock trace event under
the lockdep recursion protection.

Signed-off-by: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Steven Rostedt
Cc: Paul Mackerras
Cc: Hitoshi Mitake
Cc: Li Zefan
Cc: Lai Jiangshan
Cc: Masami Hiramatsu
Cc: Jens Axboe

Frederic Weisbecker
2010-03-10 21:26:07 +0800
d4944a066 perf: Provide better condition for event rotation ... Browse Code »

Try to avoid useless rotation and PMU disables.

[ Could be improved by keeping a nr_runnable count to better account
for the < PERF_STAT_INACTIVE counters ]

Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-03-10 20:22:36 +0800
32975a4f1 perf: Optimize perf_disable ... Browse Code »

Currently we always call hw_perf_disable(), even if its already disabled,
this seems superflous, esp. since it cannot be made NMI safe (see further
patches).

Signed-off-by: Peter Zijlstra
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
Cc: Arnaldo Carvalho de Melo
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-03-10 20:22:25 +0800
3f6da3905 perf: Rework and fix the arch CPU-hotplug hooks ... Browse Code »

Remove the hw_perf_event_*() hotplug hooks in favour of per PMU hotplug
notifiers. This has the advantage of reducing the static weak interface
as well as exposing all hotplug actions to the PMU.

Use this to fix x86 hotplug usage where we did things in ONLINE which
should have been done in UP_PREPARE or STARTING.

Signed-off-by: Peter Zijlstra
Cc: Paul Mundt
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
Cc: Arnaldo Carvalho de Melo
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-03-10 20:22:24 +0800
dc1d628a6 perf: Provide generic perf_sample_data initialization ... Browse Code »

This makes it easier to extend perf_sample_data and fixes a bug on arm
and sparc, which failed to set ->raw to NULL, which can cause crashes
when combined with PERF_SAMPLE_RAW.

It also optimizes PowerPC and tracepoint, because the struct
initialization is forced to zero out the whole structure.

Signed-off-by: Peter Zijlstra
Acked-by: Jean Pihet
Reviewed-by: Frederic Weisbecker
Acked-by: David S. Miller
Cc: Jamie Iles
Cc: Paul Mackerras
Cc: Stephane Eranian
Cc: stable@kernel.org
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-03-10 20:22:23 +0800
548b84166 Merge commit 'v2.6.34-rc1' into perf/urgent ... Browse Code »

Conflicts:
tools/perf/util/probe-event.c

Merge reason: Pick up -rc1 and resolve the conflict as well.

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-03-10 00:11:53 +0800

08 Mar, 2010

5 commits

361795b1e sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributes ... Browse Code »

A little more whack-a-mole annotating the dynamic sysfs attributes. I
had everything built into my earlier test kernel, and so I missed
these.

Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:51 +0800
a07e4156a sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on dynamic attributes ... Browse Code »

These are the non-static sysfs attributes that exist on
my test machine. Fix them to use sysfs_attr_init or
sysfs_bin_attr_init as appropriate. It simply requires
making a sysfs attribute present to see this. So this
is a little bit tedious but otherwise not too bad.

Signed-off-by: Eric W. Biederman
Acked-by: WANG Cong
Cc: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:51 +0800
52cf25d0a Driver core: Constify struct sysfs_ops in struct kobj_type ... Browse Code »

Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime

* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime

* potentially better optimized code as the compiler
can assume that the const data cannot be changed

* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing

Signed-off-by: Emese Revfy
Acked-by: David Teigland
Acked-by: Matt Domsch
Acked-by: Maciej Sosnowski
Acked-by: Hans J. Koch
Acked-by: Pekka Enberg
Acked-by: Jens Axboe
Acked-by: Stephen Hemminger
Signed-off-by: Greg Kroah-Hartman

Emese Revfy
2010-03-08 09:04:49 +0800
9cd43611c kobject: Constify struct kset_uevent_ops ... Browse Code »

Constify struct kset_uevent_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime

* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime

* potentially better optimized code as the compiler
can assume that the const data cannot be changed

* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing

Signed-off-by: Emese Revfy
Signed-off-by: Greg Kroah-Hartman

Emese Revfy
2010-03-08 09:04:49 +0800
c9be0a36f sysdev: Pass attribute in sysdev_class attributes show/store ... Browse Code »

Passing the attribute to the low level IO functions allows all kinds
of cleanups, by sharing low level IO code without requiring
an own function for every piece of data.

Also drivers can extend the attributes with own data fields
and use that in the low level function.

Similar to sysdev_attributes and normal attributes.

This is a tree-wide sweep, converting everything in one go.

No functional changes in this patch other than passing the new
argument everywhere.

Tested on x86, the non x86 parts are uncompiled.

Signed-off-by: Andi Kleen
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2010-03-08 09:04:47 +0800

07 Mar, 2010

16 commits

8d9032bbe elf coredump: add extended numbering support ... Browse Code »

The current ELF dumper implementation can produce broken corefiles if
program headers exceed 65535. This number is determined by the number of
vmas which the process have. In particular, some extreme programs may use
more than 65535 vmas. (If you google max_map_count, you can find some
users facing this problem.) This kind of program never be able to generate
correct coredumps.

This patch implements ``extended numbering'' that uses sh_info field of
the first section header instead of e_phnum field in order to represent
upto 4294967295 vmas.

This is supported by
AMD64-ABI(http://www.x86-64.org/documentation.html) and
Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
Of course, we are preparing patches for gdb and binutils.

Signed-off-by: Daisuke HATAYAMA
Cc: "Luck, Tony"
Cc: Jeff Dike
Cc: David Howells
Cc: Greg Ungerer
Cc: Roland McGrath
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Andi Kleen
Cc: Alan Cox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke HATAYAMA
2010-03-07 03:26:46 +0800
1fcccbac8 elf coredump: replace ELF_CORE_EXTRA_* macros by functions ... Browse Code »

elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
macro for hiding _multiline_ logics in functions. This patch removes
#ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For
architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
order to reduce a range of modification.

This cleanup is for my next patches, but I think this cleanup itself is
worth doing regardless of my firnal purpose.

Signed-off-by: Daisuke HATAYAMA
Cc: "Luck, Tony"
Cc: Jeff Dike
Cc: David Howells
Cc: Greg Ungerer
Cc: Roland McGrath
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Andi Kleen
Cc: Alan Cox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke HATAYAMA
2010-03-07 03:26:45 +0800
cea83886d printk: avoid warning when CONFIG_PRINTK is disabled ... Browse Code »

kernel/printk.c:72: warning: `saved_console_loglevel' defined but not used

Signed-off-by: Gustavo F. Padovan
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gustavo F. Padovan
2010-03-07 03:26:33 +0800
9728e5d6e kernel/pid.c: update comment on find_task_by_pid_ns ... Browse Code »

tasklist_lock does protect the task and its pid, it can't go away. The
problem is that find_pid_ns() itself is unsafe without rcu lock, it can
race with copy_process()->free_pid(any_pid).

Protecting copy_process()->free_pid(any_pid) with tasklist_lock would make
it possible to call find_task_by_pid_ns() under tasklist safely, but we
don't do so because we are trying to get rid of the read_lock sites of
tasklist_lock.

Signed-off-by: Tetsuo Handa
Cc: Oleg Nesterov
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2010-03-07 03:26:33 +0800
8aeee85a2 panic: fix panic_timeout accuracy when running on a hypervisor ... Browse Code »

I've had some complaints about panic_timeout being wildly innacurate on
shared processor PowerPC partitions (a 3 minute panic_timeout taking 30
minutes).

The problem is we loop on mdelay(1) and with a 1ms in 10ms hypervisor
timeslice each of these will take 10ms (ie 10x) longer. I expect other
platforms with shared processor hypervisors will see the same issue.

This patch keeps the old behaviour if we have a panic_blink (only keyboard
LEDs right now) and does 1 second mdelays if we don't.

Signed-off-by: Anton Blanchard
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Blanchard
2010-03-07 03:26:33 +0800
78d7d407b kernel core: use helpers for rlimits ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: john stultz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:33 +0800
d4bb52743 posix-cpu-timers: cleanup rlimits usage ... Browse Code »

Fetch rlimit (both hard and soft) values only once and work on them. It
removes many accesses through sig structure and makes the code cleaner.

Mostly a preparation for writable resource limits support.

Signed-off-by: Jiri Slaby
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: john stultz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:32 +0800
f3abd4f95 kernel/exit.c: fix shadows sparse warning ... Browse Code »

kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one
kernel/exit.c:1173:21: originally declared here

Signed-off-by: Thiago Farina
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thiago Farina
2010-03-07 03:26:32 +0800
9c03c3835 includecheck fix for kernel/params.c ... Browse Code »

Fix the following 'make includecheck' warning:
kernel/params.c: linux/string.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput
Cc: André Goddard Rosa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jaswinder Singh Rajput
2010-03-07 03:26:32 +0800
5f1664f92 splice: comparing unsigned int < 0 ... Browse Code »

"ret" needs to be signed or the error handling for splice_to_pipe() won't
work correctly.

Signed-off-by: Dan Carpenter
Cc: Tom Zanussi
Cc: Jens Axboe
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2010-03-07 03:26:32 +0800
87d5e0236 kernel/cpu.c: delete deprecated definition in cpu_up() ... Browse Code »

Additional_cpus is only supported for IA64 now. X86_64 should not be
included.

Signed-off-by: Chen Gong
Cc: Ingo Molnar
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gong
2010-03-07 03:26:28 +0800
452aa6999 mm/pm: force GFP_NOIO during suspend/hibernation and resume ... Browse Code »

There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang, because
the I/O operations they depend on cannot be completed due to the
underlying devices being suspended.

Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the original
values of these bits in gfp_allowed_mask durig the subsequent resume.

[akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
Signed-off-by: Rafael J. Wysocki
Reported-by: Maxim Levitsky
Cc: Sebastian Ott
Cc: Benjamin Herrenschmidt
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2010-03-07 03:26:26 +0800
5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
d559db086 mm: clean up mm_counter ... Browse Code »

Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:23 +0800
984b3f574 bitops: rename for_each_bit() to for_each_set_bit() ... Browse Code »

Rename for_each_bit to for_each_set_bit in the kernel source tree. To
permit for_each_clear_bit(), should that ever be added.

The patch includes a macro to map the old for_each_bit() onto the new
for_each_set_bit(). This is a (very) temporary thing to ease the migration.

[akpm@linux-foundation.org: add temporary for_each_bit()]
Suggested-by: Alexey Dobriyan
Suggested-by: Andrew Morton
Signed-off-by: Akinobu Mita
Cc: "David S. Miller"
Cc: Russell King
Cc: David Woodhouse
Cc: Artem Bityutskiy
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2010-03-07 03:26:23 +0800

06 Mar, 2010

2 commits

660f6a360 Merge branch 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/ker… ... Browse Code »

…nel/git/tip/linux-2.6-tip

* 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Issue at least one memory barrier in stop_machine_text_poke()
perf probe: Correct probe syntax on command line help
perf probe: Add lazy line matching support
perf probe: Show more lines after last line
perf probe: Check function address range strictly in line finder
perf probe: Use libdw callback routines
perf probe: Use elfutils-libdw for analyzing debuginfo
perf probe: Rename probe finder functions
perf probe: Fix bugs in line range finder
perf probe: Update perf probe document
perf probe: Do not show --line option without dwarf support
kprobes: Add documents of jump optimization
kprobes/x86: Support kprobes jump optimization on x86
x86: Add text_poke_smp for SMP cross modifying code
kprobes/x86: Cleanup save/restore registers
kprobes/x86: Boost probes when reentering
kprobes: Jump optimization sysctl interface
kprobes: Introduce kprobes jump optimization
kprobes: Introduce generic insn_slot framework
kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE

Linus Torvalds
2010-03-06 02:50:22 +0800
586fac13f Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
padata: Allocate the cpumask for the padata instance
crypto: authenc - Move saved IV in front of the ablkcipher request
crypto: hash - Fix handling of unaligned buffers
crypto: authenc - Use correct ahash complete functions
crypto: md5 - Set statesize

Linus Torvalds
2010-03-06 02:47:57 +0800

05 Mar, 2010

1 commit

0f2cc4ecd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
init: Open /dev/console from rootfs
mqueue: fix typo "failues" -> "failures"
mqueue: only set error codes if they are really necessary
mqueue: simplify do_open() error handling
mqueue: apply mathematics distributivity on mq_bytes calculation
mqueue: remove unneeded info->messages initialization
mqueue: fix mq_open() file descriptor leak on user-space processes
fix race in d_splice_alias()
set S_DEAD on unlink() and non-directory rename() victims
vfs: add NOFOLLOW flag to umount(2)
get rid of ->mnt_parent in tomoyo/realpath
hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Mirror MS_KERNMOUNT in ->mnt_flags
get rid of useless vfsmount_lock use in put_mnt_ns()
Take vfsmount_lock to fs/internal.h
get rid of insanity with namespace roots in tomoyo
take check for new events in namespace (guts of mounts_poll()) to namespace.c
Don't mess with generic_permission() under ->d_lock in hpfs
sanitize const/signedness for udf
nilfs: sanitize const/signedness in dealing with ->d_name.name
...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c

Linus Torvalds
2010-03-05 00:15:33 +0800

04 Mar, 2010

1 commit

4f16d4e0c Merge branch 'perf/core' into perf/urgent ... Browse Code »

Merge reason: Switch from pre-merge topical split to the post-merge urgent track

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-03-04 18:47:52 +0800