18 May, 2010

3 commits

  • …el/git/tip/linux-2.6-tip

    * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    tracing: Fix "integer as NULL pointer" warning.
    tracing: Fix tracepoint.h DECLARE_TRACE() to allow more than one header
    tracing: Make the documentation clear on trace_event boot option
    ring-buffer: Wrap open-coded WARN_ONCE
    tracing: Convert nop macros to static inlines
    tracing: Fix sleep time function profiling
    tracing: Show sample std dev in function profiling
    tracing: Add documentation for trace commands mod, traceon/traceoff
    ring-buffer: Make benchmark handle missed events
    ring-buffer: Make non-consuming read less expensive with lots of cpus.
    tracing: Add graph output support for irqsoff tracer
    tracing: Have graph flags passed in to ouput functions
    tracing: Add ftrace events for graph tracer
    tracing: Dump either the oops's cpu source or all cpus buffers
    tracing: Fix uninitialized variable of tracing/trace output

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
    stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
    sched, wait: Use wrapper functions
    sched: Remove a stale comment
    ondemand: Make the iowait-is-busy time a sysfs tunable
    ondemand: Solve a big performance issue by counting IOWAIT time as busy
    sched: Intoduce get_cpu_iowait_time_us()
    sched: Eliminate the ts->idle_lastupdate field
    sched: Fold updating of the last_update_time_info into update_ts_time_stats()
    sched: Update the idle statistics in get_cpu_idle_time_us()
    sched: Introduce a function to update the idle statistics
    sched: Add a comment to get_cpu_idle_time_us()
    cpu_stop: add dummy implementation for UP
    sched: Remove rq argument to the tracepoints
    rcu: need barrier() in UP synchronize_sched_expedited()
    sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
    sched: kill paranoia check in synchronize_sched_expedited()
    sched: replace migration_thread with cpu_stop
    stop_machine: reimplement using cpu_stop
    cpu_stop: implement stop_cpu[s]()
    sched: Fix select_idle_sibling() logic in select_task_rq_fair()
    ...

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
    perf tools: Add mode to build without newt support
    perf symbols: symbol inconsistency message should be done only at verbose=1
    perf tui: Add explicit -lslang option
    perf options: Type check all the remaining OPT_ variants
    perf options: Type check OPT_BOOLEAN and fix the offenders
    perf options: Check v type in OPT_U?INTEGER
    perf options: Introduce OPT_UINTEGER
    perf tui: Add workaround for slang < 2.1.4
    perf record: Fix bug mismatch with -c option definition
    perf options: Introduce OPT_U64
    perf tui: Add help window to show key associations
    perf tui: Make <- exit menus too
    perf newt: Add single key shortcuts for zoom into DSO and threads
    perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
    perf newt: Fix the 'A'/'a' shortcut for annotate
    perf newt: Make <- exit the ui_browser
    x86, perf: P4 PMU - fix counters management logic
    perf newt: Make <- zoom out filters
    perf report: Report number of events, not samples
    perf hist: Clarify events_stats fields usage
    ...

    Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c

    Linus Torvalds
     

07 May, 2010

2 commits


06 May, 2010

1 commit


05 May, 2010

1 commit


01 May, 2010

1 commit

  • The breakpoint generic layer assumes that archs always know in advance
    the static number of address registers available to host breakpoints
    through the HBP_NUM macro.

    However this is not true for every archs. For example Arm needs to get
    this information dynamically to handle the compatiblity between
    different versions.

    To solve this, this patch proposes to drop the static HBP_NUM macro
    and let the arch provide the number of available slots through a
    new hw_breakpoint_slots() function. For archs that have
    CONFIG_HAVE_MIXED_BREAKPOINTS_REGS selected, it will be called once
    as the number of registers fits for instruction and data breakpoints
    together.
    For the others it will be called first to get the number of
    instruction breakpoint registers and another time to get the
    data breakpoint registers, the targeted type is given as a
    parameter of hw_breakpoint_slots().

    Reported-by: Will Deacon
    Signed-off-by: Frederic Weisbecker
    Acked-by: Paul Mundt
    Cc: Mahesh Salgaonkar
    Cc: K. Prasad
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jason Wessel
    Cc: Ingo Molnar

    Frederic Weisbecker
     

28 Apr, 2010

5 commits

  • When sleep_time is off the function profiler ignores the time that a task
    is scheduled out. When the task is scheduled out a timestamp is taken.
    When the task is scheduled back in, the timestamp is compared to the
    current time and the saved calltimes are adjusted accordingly.

    But when stopping the function profiler, the sched switch hook that
    does this adjustment was stopped before shutting down the tracer.
    This allowed some tasks to not get their timestamps set when they
    scheduled out. When the function profiler started again, this would
    skew the times of the scheduler functions.

    This patch moves the stopping of the sched switch to after the function
    profiler is stopped. It also ignores zero set calltimes, which may
    happen on start up.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When combined with function graph tracing the ftrace function profiler
    also prints the average run time of functions. While this gives us some
    good information, it doesn't tell us anything about the variance of the
    run times of the function. This change prints out the s^2 sample
    standard deviation alongside the average.

    This change adds one entry to the profile record structure. This
    increases the memory footprint of the function profiler by 1/3 on a
    32-bit system, and by 1/5 on a 64-bit system when function graphing is
    enabled, though the memory is only allocated when the profiler is turned
    on. During the profiling, one extra line of code adds the squared
    calltime to the new record entry, so this should not adversly affect
    performance.

    Note that the square of the sample standard deviation is printed because
    there is no sqrt implementation for unsigned long long in the kernel.

    Signed-off-by: Chase Douglas
    LKML-Reference:

    [ fixed comment about ns^2 -> us^2 conversion ]
    Signed-off-by: Steven Rostedt

    Chase Douglas
     
  • With the addition of the "missed events" flags that is stored in the
    commit field of the ring buffer page, the ring_buffer_benchmark
    was not updated to handle this. If events are missed, then the
    missed events flag is set in the ring buffer page, the benchmark
    will count that flag as part of the size of the page and will hit the BUG()
    when it tries to read beyond the page.

    The solution is simply to have the ring buffer benchmark mask off
    the extra bits.

    Reported-by: Ingo Molnar
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When performing a non-consuming read, a synchronize_sched() is
    performed once for every cpu which is actively tracing.

    This is very expensive, and can make it take several seconds to open
    up the 'trace' file with lots of cpus.

    Only one synchronize_sched() call is actually necessary. What is
    desired is for all cpus to see the disabling state change. So we
    transform the existing sequence:

    for_each_cpu() {
    ring_buffer_read_start();
    }

    where each ring_buffer_start() call performs a synchronize_sched(),
    into the following:

    for_each_cpu() {
    ring_buffer_read_prepare();
    }
    ring_buffer_read_prepare_sync();
    for_each_cpu() {
    ring_buffer_read_start();
    }

    wherein only the single ring_buffer_read_prepare_sync() call needs to
    do the synchronize_sched().

    The first phase, via ring_buffer_read_prepare(), allocates the 'iter'
    memory and increments ->record_disabled.

    In the second phase, ring_buffer_read_prepare_sync() makes sure this
    ->record_disabled state is visible fully to all cpus.

    And in the final third phase, the ring_buffer_read_start() calls reset
    the 'iter' objects allocated in the first phase since we now know that
    none of the cpus are adding trace entries any more.

    This makes openning the 'trace' file nearly instantaneous on a
    sparc64 Niagara2 box with 128 cpus tracing.

    Signed-off-by: David S. Miller
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    David Miller
     
  • Add function graph output to irqsoff tracer.

    The graph output is enabled by setting new 'display-graph' trace option.

    Signed-off-by: Jiri Olsa
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     

27 Apr, 2010

2 commits


22 Apr, 2010

1 commit

  • The ftrace_dump_on_oops kernel parameter, sysctl and sysrq let one
    dump every cpu buffers when an oops or panic happens.

    It's nice when you have few cpus but it may take ages if have many,
    plus you miss the real origin of the problem in all the cpu traces.

    Sometimes, all you need is to dump the cpu buffer that triggered the
    opps, most of the time it is our main interest.

    This patch modifies ftrace_dump_on_oops to handle this choice.

    The ftrace_dump_on_oops kernel parameter, when it comes alone, has
    the same behaviour than before. But ftrace_dump_on_oops=orig_cpu
    will only dump the buffer of the cpu that oops'ed.

    Similarly, sysctl kernel.ftrace_dump_on_oops=1 and
    echo 1 > /proc/sys/kernel/ftrace_dump_on_oops keep their previous
    behaviour. But setting 2 jumps into cpu origin dump mode.

    v2: Fix double setup
    v3: Fix spelling issues reported by Randy Dunlap
    v4: Also update __ftrace_dump in the selftests

    Signed-off-by: Frederic Weisbecker
    Acked-by: David S. Miller
    Acked-by: Steven Rostedt
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Li Zefan
    Cc: Lai Jiangshan

    Frederic Weisbecker
     

15 Apr, 2010

1 commit

  • Support basic types of integer (u8, u16, u32, u64, s8, s16, s32, s64) in
    kprobe tracer. With this patch, users can specify above basic types on
    each arguments after ':'. If omitted, the argument type is set as
    unsigned long (u32 or u64, arch-dependent).

    e.g.
    echo 'p account_system_time+0 hardirq_offset=%si:s32' > kprobe_events

    adds a probe recording hardirq_offset in signed-32bits value on the
    entry of account_system_time.

    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     

14 Apr, 2010

1 commit


08 Apr, 2010

2 commits


05 Apr, 2010

3 commits

  • Because a local variable is not initialized, I got these
    when I did 'cat tracing/trace'. (not trace_pipe):

    CPU:0 [LOST 18446744071579453134 EVENTS]
    ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
    CPU:0 [LOST 18446744071579453134 EVENTS]
    ps-3099 [000] 560.770221: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
    CPU:0 [LOST 18446612133255294080 EVENTS]
    ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
    CPU:0 [LOST 18446744071579453134 EVENTS]
    ps-3099 [000] 560.770222: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
    CPU:0 [LOST 18446744071579453134 EVENTS]
    ps-3099 [000] 560.770222: lock_release: ffffffff816cfb98 dcache_lock

    See peek_next_entry(), it does not set *lost_events when we 'cat tracing/trace'

    Signed-off-by: Lai Jiangshan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Lai Jiangshan
     
  • Tejun Heo
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf: Always build the powerpc perf_arch_fetch_caller_regs version
    perf: Always build the stub perf_arch_fetch_caller_regs version
    perf, probe-finder: Build fix on Debian
    perf/scripts: Tuple was set from long in both branches in python_process_event()
    perf: Fix 'perf sched record' deadlock
    perf, x86: Fix callgraphs of 32-bit processes on 64-bit kernels
    perf, x86: Fix AMD hotplug & constraint initialization
    x86: Move notify_cpu_starting() callback to a later stage
    x86,kgdb: Always initialize the hw breakpoint attribute
    perf: Use hot regs with software sched switch/migrate events
    perf: Correctly align perf event tracing buffer

    Linus Torvalds
     

03 Apr, 2010

1 commit


01 Apr, 2010

4 commits

  • The trace event buffer used by perf to record raw sample events
    is typed as an array of char and may then not be aligned to 8
    by alloc_percpu().

    But we need it to be aligned to 8 in sparc64 because we cast
    this buffer into a random structure type built by the TRACE_EVENT()
    macro to store the traces. So if a random 64 bits field is accessed
    inside, it may be not under an expected good alignment.

    Use an array of long instead to force the appropriate alignment, and
    perform a compile time check to ensure the size in byte of the buffer
    is a multiple of sizeof(long) so that its actual size doesn't get
    shrinked under us.

    This fixes unaligned accesses reported while using perf lock
    in sparc 64.

    Suggested-by: David Miller
    Suggested-by: Tejun Heo
    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Steven Rostedt

    Frederic Weisbecker
     
  • Currently, binary readers of the ring buffer only know where events were
    lost, but not how many events were lost at that location.
    This information is available, but it would require adding another
    field to the sub buffer header to include it.

    But when a event can not fit at the end of a sub buffer, it is written
    to the next sub buffer. This means there is a good chance that the
    buffer may have room to hold this counter. If it does, write
    the counter at the end of the sub buffer and set another flag
    in the data size field that states that this information exists.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Now that the ring buffer can keep track of where events are lost.
    Use this information to the output of trace_pipe:

    hackbench-3588 [001] 1326.701660: lock_acquire: ffffffff816591e0 read rcu_read_lock
    hackbench-3588 [001] 1326.701661: lock_acquire: ffff88003f4091f0 &(&dentry->d_lock)->rlock
    hackbench-3588 [001] 1326.701664: lock_release: ffff88003f4091f0 &(&dentry->d_lock)->rlock
    CPU:1 [LOST 673 EVENTS]
    hackbench-3588 [001] 1326.702711: kmem_cache_free: call_site=ffffffff81102b85 ptr=ffff880026d96738
    hackbench-3588 [001] 1326.702712: lock_release: ffff88003e1480a8 &mm->mmap_sem
    hackbench-3588 [001] 1326.702713: lock_acquire: ffff88003e1480a8 &mm->mmap_sem

    Even works with the function graph tracer:

    2) ! 170.098 us | }
    2) 4.036 us | rcu_irq_exit();
    2) 3.657 us | idle_cpu();
    2) ! 190.301 us | }
    CPU:2 [LOST 2196 EVENTS]
    2) 0.853 us | } /* cancel_dirty_page */
    2) | remove_from_page_cache() {
    2) 1.578 us | _raw_spin_lock_irq();
    2) | __remove_from_page_cache() {

    Note, it does not work with the iterator "trace" file, since it requires
    the use of consuming the page from the ring buffer to determine how many
    events were lost, which the iterator does not do.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Currently, when the ring buffer drops events, it does not record
    the fact that it did so. It does inform the writer that the event
    was dropped by returning a NULL event, but it does not put in any
    place holder where the event was dropped.

    This is not a trivial thing to add because the ring buffer mostly
    runs in overwrite (flight recorder) mode. That is, when the ring
    buffer is full, new data will overwrite old data.

    In a produce/consumer mode, where new data is simply dropped when
    the ring buffer is full, it is trivial to add the placeholder
    for dropped events. When there's more room to write new data, then
    a special event can be added to notify the reader about the dropped
    events.

    But in overwrite mode, any new write can overwrite events. A place
    holder can not be inserted into the ring buffer since there never
    may be room. A reader could also come in at anytime and miss the
    placeholder.

    Luckily, the way the ring buffer works, the read side can find out
    if events were lost or not, and how many events. Everytime a write
    takes place, if it overwrites the header page (the next read) it
    updates a "overrun" variable that keeps track of the number of
    lost events. When a reader swaps out a page from the ring buffer,
    it can record this number, perfom the swap, and then check to
    see if the number changed, and take the diff if it has, which would be
    the number of events dropped. This can be stored by the reader
    and returned to callers of the reader.

    Since the reader page swap will fail if the writer moved the head
    page since the time the reader page set up the swap, this gives room
    to record the overruns without worrying about races. If the reader
    sets up the pages, records the overrun, than performs the swap,
    if the swap succeeds, then the overrun variable has not been
    updated since the setup before the swap.

    For binary readers of the ring buffer, a flag is set in the header
    of each sub page (sub buffer) of the ring buffer. This flag is embedded
    in the size field of the data on the sub buffer, in the 31st bit (the size
    can be 32 or 64 bits depending on the architecture), but only 27
    bits needs to be used for the actual size (less actually).

    We could add a new field in the sub buffer header to also record the
    number of events dropped since the last read, but this will change the
    format of the binary ring buffer a bit too much. Perhaps this change can
    be made if the information on the number of events dropped is considered
    important enough.

    Note, the notification of dropped events is only used by consuming reads
    or peeking at the ring buffer. Iterating over the ring buffer does not
    keep this information because the necessary data is only available when
    a page swap is made, and the iterator does not swap out pages.

    Cc: Robert Richter
    Cc: Andi Kleen
    Cc: Li Zefan
    Cc: Arnaldo Carvalho de Melo
    Cc: "Luis Claudio R. Goncalves"
    Cc: Frederic Weisbecker
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

30 Mar, 2010

3 commits

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     
  • In some error handling cases the lock is not unlocked. The return is
    converted to a goto, to share the unlock at the end of the function.

    A simplified version of the semantic patch that finds this problem is as
    follows: (http://coccinelle.lip6.fr/)

    //
    @r exists@
    expression E1;
    identifier f;
    @@

    f (...) { }
    //

    Signed-off-by: Julia Lawall
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Julia Lawall
     
  • # echo 1 > events/enable
    # echo global > trace_clock

    ------------[ cut here ]------------
    WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190()
    ...
    ---[ end trace 3f86734a89416623 ]---
    possible reason: unannotated irqs-on.
    ...

    There's no reason to use the raw_local_irq_save() in trace_clock_global.
    The local_irq_save() version is fine, and does not cause the bug in lockdep.

    Acked-by: Peter Zijlstra
    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan
     

27 Mar, 2010

1 commit


26 Mar, 2010

1 commit

  • Support for the PMU's BTS features has been upstreamed in
    v2.6.32, but we still have the old and disabled ptrace-BTS,
    as Linus noticed it not so long ago.

    It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
    regard for other uses (perf) and doesn't provide the flexibility
    needed for perf either.

    Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
    was never used and ptrace-block-step can be implemented using a
    much simpler approach.

    So axe all 3000 lines of it. That includes the *locked_memory*()
    APIs in mm/mlock.c as well.

    Reported-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Markus Metzger
    Cc: Steven Rostedt
    Cc: Andrew Morton
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Mar, 2010

2 commits

  • The ring buffer uses 4 byte alignment while recording events into the
    buffer, even on 64bit machines. This saves space when there are lots
    of events being recorded at 4 byte boundaries.

    The ring buffer has a zero copy method to write into the buffer, with
    the reserving of space and then committing it. This may cause problems
    when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and
    PPC this is not an issue, but on some architectures this would cause an
    out-of-alignment exception.

    This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
    if it is OK to use 4 byte alignments on 64 bit machines. If it is not,
    it forces the ring buffer event header to be 8 bytes and not 4,
    and will align the length of the data to be 8 byte aligned.
    This keeps the data payload at 8 byte alignments and will allow these
    machines to run without issue.

    The trick to this is that the header can be either 4 bytes or 8 bytes
    depending on the length of the data payload. The 4 byte header
    has a length field that supports up to 112 bytes. If the length of
    the data is more than 112, the length field is set to zero, and the actual
    length is stored in the next 4 bytes after the header.

    When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces
    zero in the 4 byte header forcing the length to be stored in the 4 byte
    array, even with a small data load. It also forces the length of the
    data load to be 8 byte aligned. The combination of these two guarantee
    that the data is always at 8 byte alignment.

    Tested-by: Frederic Weisbecker
    (on sparc64)
    Reported-by: Frederic Weisbecker
    Acked-by: David S. Miller
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
    perf: Fix unexported generic perf_arch_fetch_caller_regs
    perf record: Don't try to find buildids in a zero sized file
    perf: export perf_trace_regs and perf_arch_fetch_caller_regs
    perf, x86: Fix hw_perf_enable() event assignment
    perf, ppc: Fix compile error due to new cpu notifiers
    perf: Make the install relative to DESTDIR if specified
    kprobes: Calculate the index correctly when freeing the out-of-line execution slot
    perf tools: Fix sparse CPU numbering related bugs
    perf_event: Fix oops triggered by cpu offline/online
    perf: Drop the obsolete profile naming for trace events
    perf: Take a hot regs snapshot for trace events
    perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
    perf/x86-64: Use frame pointer to walk on irq and process stacks
    lockdep: Move lock events under lockdep recursion protection
    perf report: Print the map table just after samples for which no map was found
    perf report: Add multiple event support
    perf session: Change perf_session post processing functions to take histogram tree
    perf session: Add storage for seperating event types in report
    perf session: Change add_hist_entry to take the tree root instead of session
    perf record: Add ID and to recorded event data when recording multiple events
    ...

    Linus Torvalds
     

17 Mar, 2010

1 commit

  • perf_arch_fetch_caller_regs() is exported for the overriden x86
    version, but not for the generic weak version.

    As a general rule, weak functions should not have their symbol
    exported in the same file they are defined.

    So let's export it on trace_event_perf.c as it is used by trace
    events only.

    This fixes:

    ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined!
    ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined!

    -v2: And also only build it if trace events are enabled.
    -v3: Fix changelog mistake

    Reported-by: Stephen Rothwell
    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Paul Mackerras
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

14 Mar, 2010

2 commits

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    tracing: Do not record user stack trace from NMI context
    tracing: Disable buffer switching when starting or stopping trace
    tracing: Use same local variable when resetting the ring buffer
    function-graph: Init curr_ret_stack with ret_stack
    ring-buffer: Move disabled check into preempt disable section
    function-graph: Add tracing_thresh support to function_graph tracer
    tracing: Update the comm field in the right variable in update_max_tr
    function-graph: Use comment notation for func names of dangling '}'
    function-graph: Fix unused reference to ftrace_set_func()
    tracing: Fix warning in s_next of trace file ops
    tracing: Include irqflags headers from trace clock

    Linus Torvalds
     

13 Mar, 2010

2 commits

  • A bug was found with Li Zefan's ftrace_stress_test that caused applications
    to segfault during the test.

    Placing a tracing_off() in the segfault code, and examining several
    traces, I found that the following was always the case. The lock tracer
    was enabled (lockdep being required) and userstack was enabled. Testing
    this out, I just enabled the two, but that was not good enough. I needed
    to run something else that could trigger it. Running a load like hackbench
    did not work, but executing a new program would. The following would
    trigger the segfault within seconds:

    # echo 1 > /debug/tracing/options/userstacktrace
    # echo 1 > /debug/tracing/events/lock/enable
    # while :; do ls > /dev/null ; done

    Enabling the function graph tracer and looking at what was happening
    I finally noticed that all cashes happened just after an NMI.

    1) | copy_user_handle_tail() {
    1) | bad_area_nosemaphore() {
    1) | __bad_area_nosemaphore() {
    1) | no_context() {
    1) | fixup_exception() {
    1) 0.319 us | search_exception_tables();
    1) 0.873 us | }
    [...]
    1) 0.314 us | __rcu_read_unlock();
    1) 0.325 us | native_apic_mem_write();
    1) 0.943 us | }
    1) 0.304 us | rcu_nmi_exit();
    [...]
    1) 0.479 us | find_vma();
    1) | bad_area() {
    1) | __bad_area() {

    After capturing several traces of failures, all of them happened
    after an NMI. Curious about this, I added a trace_printk() to the NMI
    handler to read the regs->ip to see where the NMI happened. In which I
    found out it was here:

    ffffffff8135b660 :
    ffffffff8135b660: 48 83 ec 78 sub $0x78,%rsp
    ffffffff8135b664: e8 97 01 00 00 callq ffffffff8135b800

    What was happening is that the NMI would happen at the place that a page
    fault occurred. It would call rcu_read_lock() which was traced by
    the lock events, and the user_stack_trace would run. This would trigger
    a page fault inside the NMI. I do not see where the CR2 register is
    saved or restored in NMI handling. This means that it would corrupt
    the page fault handling that the NMI interrupted.

    The reason the while loop of ls helped trigger the bug, was that
    each execution of ls would cause lots of pages to be faulted in, and
    increase the chances of the race happening.

    The simple solution is to not allow user stack traces in NMI context.
    After this patch, I ran the above "ls" test for a couple of hours
    without any issues. Without this patch, the bug would trigger in less
    than a minute.

    Cc: stable@kernel.org
    Reported-by: Li Zefan
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When the trace iterator is read, tracing_start() and tracing_stop()
    is called to stop tracing while the iterator is processing the trace
    output.

    These functions disable both the standard buffer and the max latency
    buffer. But if the wakeup tracer is running, it can switch these
    buffers between the two disables:

    buffer = global_trace.buffer;
    if (buffer)
    ring_buffer_record_disable(buffer);

    <<
    Signed-off-by: Steven Rostedt

    Steven Rostedt