17 Nov, 2011

1 commit

  • People keep asking how to get the preempt count, irq, and need resched info
    and we keep telling them to enable the latency format. Some developers think
    that traces without this info is completely useless, and for a lot of tasks
    it is useless.

    The first option was to enable the latency trace as the default format, but
    the header for the latency format is pretty useless for most tracers and
    it also does the timestamp in straight microseconds from the time the trace
    started. This is sometimes more difficult to read as the default trace is
    seconds from the start of boot up.

    Latency format:

    # tracer: nop
    #
    # nop latency trace v1.1.5 on 3.2.0-rc1-test+
    # --------------------------------------------------------------------
    # latency: 0 us, #159771/64234230, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
    # -----------------
    # | task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
    # -----------------
    #
    # _------=> CPU#
    # / _-----=> irqs-off
    # | / _----=> need-resched
    # || / _---=> hardirq/softirq
    # ||| / _--=> preempt-depth
    # |||| / delay
    # cmd pid ||||| time | caller
    # \ / ||||| \ | /
    migratio-6 0...2 41778231us+: rcu_note_context_switch irqs-off
    # / _----=> need-resched
    # | / _---=> hardirq/softirq
    # || / _--=> preempt-depth
    # ||| / delay
    # TASK-PID CPU# |||| TIMESTAMP FUNCTION
    # | | | |||| | |
    -0 [000] d..2 49.309305: cpuidle_get_driver -0 [000] d..2 49.309307: mwait_idle -0 [000] d..2 49.309309: need_resched -0 [000] d..2 49.309310: test_ti_thread_flag -0 [000] d..2 49.309312: trace_power_start.constprop.13 -0 [000] d..2 49.309313: trace_cpu_idle -0 [000] d..2 49.309315: need_resched -0 [000] 49.309305: cpuidle_get_driver -0 [000] 49.309307: mwait_idle -0 [000] 49.309309: need_resched -0 [000] 49.309310: test_ti_thread_flag -0 [000] 49.309312: trace_power_start.constprop.13 -0 [000] 49.309313: trace_cpu_idle -0 [000] 49.309315: need_resched
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

15 Jul, 2011

1 commit

  • Currently the stack trace per event in ftace is only 8 frames.
    This can be quite limiting and sometimes useless. Especially when
    the "ignore frames" is wrong and we also use up stack frames for
    the event processing itself.

    Change this to be dynamic by adding a percpu buffer that we can
    write a large stack frame into and then copy into the ring buffer.

    For interrupts and NMIs that come in while another event is being
    process, will only get to use the 8 frame stack. That should be enough
    as the task that it interrupted will have the full stack frame anyway.

    Requested-by: Thomas Gleixner
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

26 May, 2011

1 commit


05 Apr, 2011

1 commit

  • running following commands:

    # enable the binary option
    echo 1 > ./options/bin
    # disable context info option
    echo 0 > ./options/context-info
    # tracing only events
    echo 1 > ./events/enable
    cat trace_pipe

    plus forcing system to generate many tracing events,
    is causing lockup (in NON preemptive kernels) inside
    tracing_read_pipe function.

    The issue is also easily reproduced by running ltp stress test.
    (ftrace_stress_test.sh)

    The reasons are:
    - bin/hex/raw output functions for events are set to
    trace_nop_print function, which prints nothing and
    returns TRACE_TYPE_HANDLED value
    - LOST EVENT trace do not handle trace_seq overflow

    These reasons force the while loop in tracing_read_pipe
    function never to break.

    The attached patch fixies handling of lost event trace, and
    changes trace_nop_print to print minimal info, which is needed
    for the correct tracing_read_pipe processing.

    v2 changes:
    - omit the cond_resched changes by trace_nop_print changes
    - WARN changed to WARN_ONCE and added info to be able
    to find out the culprit

    v3 changes:
    - make more accurate patch comment

    Signed-off-by: Jiri Olsa
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     

10 Mar, 2011

2 commits

  • Formatting change only to improve code readability. No code changes except to
    introduce intermediate variables.

    Signed-off-by: David Sharp
    LKML-Reference:

    [ Keep variable declarations and assignment separate ]

    Signed-off-by: Steven Rostedt

    David Sharp
     
  • The lock_depth field in the event headers was added as a temporary
    data point for help in removing the BKL. Now that the BKL is pretty
    much been removed, we can remove this field.

    This in turn changes the header from 12 bytes to 8 bytes,
    removing the 4 byte buffer that gcc would insert if the first field
    in the data load was 8 bytes in size.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

23 Jul, 2010

1 commit


21 Jul, 2010

1 commit

  • __print_flags() and __print_symbolic() use percpu trace_seq:

    1) Its memory is allocated at compile time, it wastes memory if we don't use tracing.
    2) It is percpu data and it wastes more memory for multi-cpus system.
    3) It disables preemption when it executes its core routine
    "trace_seq_printf(s, "%s: ", #call);" and introduces latency.

    So we move this trace_seq to struct trace_iterator.

    Signed-off-by: Lai Jiangshan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Lai Jiangshan
     

20 Jul, 2010

1 commit

  • Special traces type was only used by sysprof. Lets remove it now
    that sysprof ftrace plugin has been dropped.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Soeren Sandmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Li Zefan

    Frederic Weisbecker
     

28 May, 2010

1 commit

  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (61 commits)
    tracing: Add __used annotation to event variable
    perf, trace: Fix !x86 build bug
    perf report: Support multiple events on the TUI
    perf annotate: Fix up usage of the build id cache
    x86/mmiotrace: Remove redundant instruction prefix checks
    perf annotate: Add TUI interface
    perf tui: Remove annotate from popup menu after failure
    perf report: Don't start the TUI if -D is used
    perf: Fix getline undeclared
    perf: Optimize perf_tp_event_match()
    perf: Remove more code from the fastpath
    perf: Optimize the !vmalloc backed buffer
    perf: Optimize perf_output_copy()
    perf: Fix wakeup storm for RO mmap()s
    perf-record: Share per-cpu buffers
    perf-record: Remove -M
    perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers
    perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events
    perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction
    perf tui: Allow disabling the TUI on a per command basis in ~/.perfconfig
    ...

    Linus Torvalds
     

21 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (182 commits)
    [SCSI] aacraid: add an ifdef'd device delete case instead of taking the device offline
    [SCSI] aacraid: prohibit access to array container space
    [SCSI] aacraid: add support for handling ATA pass-through commands.
    [SCSI] aacraid: expose physical devices for models with newer firmware
    [SCSI] aacraid: respond automatically to volumes added by config tool
    [SCSI] fcoe: fix fcoe module ref counting
    [SCSI] libfcoe: FIP Keep-Alive messages for VPorts are sent with incorrect port_id and wwn
    [SCSI] libfcoe: Fix incorrect MAC address clearing
    [SCSI] fcoe: fix a circular locking issue with rtnl and sysfs mutex
    [SCSI] libfc: Move the port_id into lport
    [SCSI] fcoe: move link speed checking into its own routine
    [SCSI] libfc: Remove extra pointer check
    [SCSI] libfc: Remove unused fc_get_host_port_type
    [SCSI] fcoe: fixes wrong error exit in fcoe_create
    [SCSI] libfc: set seq_id for incoming sequence
    [SCSI] qla2xxx: Updates to ISP82xx support.
    [SCSI] qla2xxx: Optionally disable target reset.
    [SCSI] qla2xxx: ensure flash operation and host reset via sg_reset are mutually exclusive
    [SCSI] qla2xxx: Silence bogus warning by gcc for wrap and did.
    [SCSI] qla2xxx: T10 DIF support added.
    ...

    Linus Torvalds
     

15 May, 2010

1 commit

  • Multiple events may use the same method to print their data.
    Instead of having all events have a pointer to their print funtions,
    the trace_event structure now points to a trace_event_functions structure
    that will hold the way to print ouf the event.

    The event itself is now passed to the print function to let the print
    function know what kind of event it should print.

    This opens the door to consolidating the way several events print
    their output.

    text data bss dec hex filename
    4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
    4900382 1048964 861512 6810858 67ecea vmlinux.init
    4900446 1049028 861512 6810986 67ed6a vmlinux.preprint

    This change slightly increases the size but is needed for the next change.

    v3: Fix the branch tracer events to handle this change.

    v2: Fix the new function graph tracer event calls to handle this change.

    Acked-by: Mathieu Desnoyers
    Acked-by: Masami Hiramatsu
    Acked-by: Frederic Weisbecker
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

06 May, 2010

1 commit


01 May, 2010

2 commits


10 Dec, 2009

2 commits

  • The trace_seq buffer might fill up, and right now one needs to check the
    return value of each printf into the buffer to check for that.

    Instead, have the buffer keep track of whether it is full or not, and
    reject more input if it is full or would have overflowed with an input
    that wasn't added.

    Cc: Lai Jiangshan
    Signed-off-by: Johannes Berg
    Signed-off-by: Steven Rostedt

    Johannes Berg
     
  • If the seq_read fills the buffer it will call s_start again on the next
    itertation with the same position. This causes a problem with the
    function_graph tracer because it consumes the iteration in order to
    determine leaf functions.

    What happens is that the iterator stores the entry, and the function
    graph plugin will look at the next entry. If that next entry is a return
    of the same function and task, then the function is a leaf and the
    function_graph plugin calls ring_buffer_read which moves the ring buffer
    iterator forward (the trace iterator still points to the function start
    entry).

    The copying of the trace_seq to the seq_file buffer will fail if the
    seq_file buffer is full. The seq_read will not show this entry.
    The next read by userspace will cause seq_read to again call s_start
    which will reuse the trace iterator entry (the function start entry).
    But the function return entry was already consumed. The function graph
    plugin will think that this entry is a nested function and not a leaf.

    To solve this, the trace code now checks the return status of the
    seq_printf (trace_print_seq). If the writing to the seq_file buffer
    fails, we set a flag in the iterator (leftover) and we do not reset
    the trace_seq buffer. On the next call to s_start, we check the leftover
    flag, and if it is set, we just reuse the trace_seq buffer and do not
    call into the plugin print functions.

    Before this patch:

    2) | fput() {
    2) | __fput() {
    2) 0.550 us | inotify_inode_queue_event();
    2) | __fsnotify_parent() {
    2) 0.540 us | inotify_dentry_parent_queue_event();

    After the patch:

    2) | fput() {
    2) | __fput() {
    2) 0.550 us | inotify_inode_queue_event();
    2) 0.548 us | __fsnotify_parent();
    2) 0.540 us | inotify_dentry_parent_queue_event();

    [
    Updated the patch to fix a missing return 0 from the trace_print_seq()
    stub when CONFIG_TRACING is disabled.

    Reported-by: Ingo Molnar
    ]

    Reported-by: Jiri Olsa
    Cc: Frederic Weisbecker
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

24 Oct, 2009

1 commit

  • trace_seq_printf() return value is a little ambiguous. It
    currently returns the length of the space available in the
    buffer. printf usually returns the amount written. This is not
    adequate here, because:

    trace_seq_printf(s, "");

    is perfectly legal, and returning 0 would indicate that it
    failed.

    We can always see the amount written by looking at the before
    and after values of s->len. This is not quite the same use as
    printf. We only care if the string was successfully written to
    the buffer or not.

    Make trace_seq_printf() return 0 if the trace oversizes the
    buffer's free space, 1 otherwise.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

08 Oct, 2009

1 commit


06 Oct, 2009

1 commit

  • The state char variable S should be reassigned, if S == 0.

    We are missing the state of the task that is going to sleep for the
    context switch events (in the raw mode).

    Fortunately the problem arises with the sched_switch/wake_up
    tracers, not the sched trace events.

    The formers are legacy now. But still, that was buggy.

    Signed-off-by: Hiroshi Shimamoto
    Cc: Steven Rostedt
    Acked-by: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     

12 Sep, 2009

2 commits

  • Both trace_output.c and trace_function_graph.c do basically the same
    thing to handle the printing of the latency-format. This patch moves
    the code into one function that both can use.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • This patch adds the lock depth of the big kernel lock to the generic
    entry header. This way we can see the depth of the lock and help
    in removing the BKL.

    Example:

    # _------=> CPU#
    # / _-----=> irqs-off
    # | / _----=> need-resched
    # || / _---=> hardirq/softirq
    # ||| / _--=> preempt-depth
    # |||| /_--=> lock-depth
    # |||||/ delay
    # cmd pid |||||| time | caller
    # \ / |||||| \ | /
    -0 2.N..3 5902255250us+: lock_acquire: read rcu_read_lock
    -0 2.N..3 5902255253us+: lock_release: rcu_read_lock
    -0 2dN..3 5902255257us+: lock_acquire: xtime_lock
    -0 2dN..4 5902255259us : lock_acquire: clocksource_lock
    -0 2dN..4 5902255261us+: lock_release: clocksource_lock

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

11 Sep, 2009

1 commit


02 Jul, 2009

1 commit

  • We will lose something if trace_seq->buffer[0] is 0, because the copy length
    is calculated by strlen() in seq_puts(), so using seq_write() instead of
    seq_puts().

    There have a example:
    after reboot:

    # echo kmemtrace > current_tracer
    # echo 0 > options/kmem_minimalistic
    # cat trace
    # tracer: kmemtrace
    #
    #

    Nothing is exported, because the first byte of trace_seq->buffer[ ]
    is KMEMTRACE_USER_ALLOC.

    ( the value of KMEMTRACE_USER_ALLOC is zero, seeing
    kmemtrace_print_alloc_user() in kernel/trace/kmemtrace.c)

    Signed-off-by: Xiao Guangrong
    Acked-by: Frederic Weisbecker
    Acked-by: Pekka Enberg
    Acked-by: Eduard - Gabriel Munteanu
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Xiao Guangrong
     

10 Jun, 2009

2 commits


03 Jun, 2009

4 commits

  • The current method of printing out a stack trace is to add a new line
    and print out the trace:

    yum-updatesd-3120 [002] 573.691303:
    => do_softirq
    => irq_exit
    => smp_apic_timer_interrupt
    => apic_timer_interrupt

    This looks a bit awkward, and if we have both stack and user stack traces
    running, it would be nice to have a title to tell them apart, although
    it is easy to tell by the output.

    This patch adds an annotation to the start of the stack traces:

    init-1 [003] 929.304979:
    => user_path_at
    => vfs_fstatat
    => vfs_stat
    => sys_newstat
    => system_call_fastpath

    cat-3459 [002] 1016.824040:
    =>
    =>
    =>

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Here is an updated patch to include the extra call to
    trace_seq_init() as requested. This is vs. the latest
    -tip tree and fixes the use of multiple __print_flags
    and __print_symbolic in a single tracer. Also tested
    to ensure its working now:

    mount.gfs2-2534 [000] 235.850587: gfs2_glock_queue: 8.7 glock 1:2 dequeue PR
    mount.gfs2-2534 [000] 235.850591: gfs2_demote_rq: 8.7 glock 1:0 demote EX to NL flags:DI
    mount.gfs2-2534 [000] 235.850591: gfs2_glock_queue: 8.7 glock 1:0 dequeue EX
    glock_workqueue-2529 [000] 235.850666: gfs2_glock_state_change: 8.7 glock 1:0 state EX => NL tgt:NL dmt:NL flags:lDpI
    glock_workqueue-2529 [000] 235.850672: gfs2_glock_put: 8.7 glock 1:0 state NL => IV flags:I

    Signed-off-by: Steven Whitehouse
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Steven Whitehouse
     
  • According to "events/ftrace/user_stack/format", fix the output of
    user stack.

    before fix:

    sh-1073 [000] 31.137561:

    after fix:

    sh-1072 [000] 37.039329:
    =>
    =>
    =>

    Signed-off-by: walimis
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    walimis
     
  • According to "events/ftrace/kernel_stack/format", output format of
    kernel stack should use "=>" instead of " sh:1073 [120]
    sh-1072 [000] 26.957752:
    sys_clone
    => syscall_call
    sh-1075 [000] 39.792713: sched_switch: task sh:1075 [120] (R) ==> sh:1076 [120]
    sh-1075 [000] 39.792722:
    => schedule
    => preempt_schedule
    => wake_up_new_task
    => do_fork
    => sys_clone
    => syscall_call

    Signed-off-by: walimis
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    walimis
     

02 Jun, 2009

1 commit


27 May, 2009

2 commits

  • This patch adds __print_symbolic which is similar to __print_flags but
    works for an enumeration type instead. That is, there is only a one to one
    mapping between the values and the symbols. When a match is made, then
    it is printed, otherwise the hex value is outputed.

    [ Impact: add interface for showing symbol names in events ]

    Signed-off-by: Steven Rostedt
    Signed-off-by: Frederic Weisbecker

    Steven Rostedt
     
  • Developers have been asking for the ability in the ftrace event tracer
    to display names of bits in a flags variable.

    Instead of printing out c2, it would be easier to read FOO|BAR|GOO,
    assuming that FOO is bit 1, BAR is bit 6 and GOO is bit 7.

    Some examples where this would be useful are the state flags in a context
    switch, kmalloc flags, and even permision flags in accessing files.

    [
    v2 changes include:

    Frederic Weisbecker's idea of using a mask instead of bits,
    thus we can output GFP_KERNEL instead of GPF_WAIT|GFP_IO|GFP_FS.

    Li Zefan's idea of allowing the caller of __print_flags to add their
    own delimiter (or no delimiter) where we can get for file permissions
    rwx instead of r|w|x.
    ]

    [
    v3 changes:

    Christoph Hellwig's idea of using an array instead of va_args.
    ]

    [ Impact: better displaying of flags in trace output ]

    Signed-off-by: Steven Rostedt
    Signed-off-by: Frederic Weisbecker

    Steven Rostedt
     

26 May, 2009

1 commit

  • I found that there is nothing to protect event_hash in
    ftrace_find_event(). Rcu protects the event hashlist
    but not the event itself while we use it after its extraction
    through ftrace_find_event().

    This lack of a proper locking in this spot opens a race
    window between any event dereferencing and module removal.

    Eg:

    --Task A--

    print_trace_line(trace) {
    event = find_ftrace_event(trace)

    --Task B--

    trace_module_remove_events(mod) {
    list_trace_events_module(ev, mod) {
    unregister_ftrace_event(ev->event) {
    hlist_del(ev->event->node)
    list_del(....)
    }
    }
    }
    |--> module removed, the event has been dropped

    --Task A--

    event->print(trace); // Dereferencing freed memory

    If the event retrieved belongs to a module and this module
    is concurrently removed, we may end up dereferencing a data
    from a freed module.

    RCU could solve this, but it would add latency to the kernel and
    forbid tracers output callbacks to call any sleepable code.
    So this fix converts 'trace_event_mutex' to a read/write semaphore,
    and adds trace_event_read_lock() to protect ftrace_find_event().

    [ Impact: fix possible freed memory dereference in ftrace ]

    Signed-off-by: Lai Jiangshan
    Acked-by: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Lai Jiangshan
     

15 May, 2009

1 commit

  • The stack tracer stores eight entries in the ring buffer when an event
    traces the stack. The output outputs all eight entries regardless of
    how many entries were recorded.

    This patch breaks out of the loop when a null entry is discovered.

    [ Impact: only print the stack that is recorded ]

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

06 May, 2009

1 commit

  • This compiler warning:

    CC kernel/trace/trace_output.o
    kernel/trace/trace_output.c: In function ‘register_ftrace_event’:
    kernel/trace/trace_output.c:544: warning: ‘list’ may be used uninitialized in this function

    Is wrong as 'list' is always initialized - but GCC (4.3.2) does not
    recognize this relationship properly.

    Work around the warning by initializing the variable to NULL.

    [ Impact: fix false positive compiler warning ]

    Signed-off-by: Jaswinder Singh Rajput
    Acked-by: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jaswinder Singh Rajput
     

25 Apr, 2009

1 commit

  • With modules being able to add trace events, and the max trace event
    counter is 16 bits (65536) we can overflow the counter easily
    with a simple while loop adding and removing modules that contain
    trace events.

    This patch links together the registered trace events and on overflow
    searches for available trace event ids. It will still fail if
    over 65536 events are registered, but considering that a typical
    kernel only has 22000 functions, 65000 events should be sufficient.

    Reported-by: Li Zefan
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

24 Apr, 2009

1 commit

  • With the new event tracing registration, we must increase the number
    of events that can be registered. Currently the type field is only
    one byte, which leaves us only 256 possible events.

    Since we do not save the CPU number in the tracer anymore (it is determined
    by the per cpu ring buffer that is used) we have an extra byte to use.

    This patch increases the size of type from 1 byte (256 events) to
    2 bytes (65,536 events).

    It also adds a WARN_ON_ONCE if we exceed that limit.

    [ Impact: allow more than 255 events ]

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

15 Apr, 2009

1 commit


07 Apr, 2009

1 commit