07 Jan, 2010

2 commits

  • If the very unlikely case happens where the writer moves the head by one
    between where the head page is read and where the new reader page
    is assigned _and_ the writer then writes and wraps the entire ring buffer
    so that the head page is back to what was originally read as the head page,
    the page to be swapped will have a corrupted next pointer.

    Simple solution is to wrap the assignment of the next pointer with a
    rb_list_head().

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • This reference at the end of rb_get_reader_page() was causing off-by-one
    writes to the prev pointer of the page after the reader page when that
    page is the head page, and therefore the reader page has the RB_PAGE_HEAD
    flag in its list.next pointer. This eventually results in a GPF in a
    subsequent call to rb_set_head_page() (usually from rb_get_reader_page())
    when that prev pointer is dereferenced. The dereferenced register would
    characteristically have an address that appears shifted left by one byte
    (eg, ffxxxxxxxxxxxxyy instead of ffffxxxxxxxxxxxx) due to being written at
    an address one byte too high.

    Signed-off-by: David Sharp
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    David Sharp
     

17 Dec, 2009

1 commit

  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    tracing: Fix return of trace_dump_stack()
    ksym_tracer: Fix bad cast
    tracing/power: Remove two exports
    tracing: Change event->profile_count to be int type
    tracing: Simplify trace_option_write()
    tracing: Remove useless trace option
    tracing: Use seq file for trace_clock
    tracing: Use seq file for trace_options
    function-graph: Allow writing the same val to set_graph_function
    ftrace: Call trace_parser_clear() properly
    ftrace: Return EINVAL when writing invalid val to set_ftrace_filter
    tracing: Move a printk out of ftrace_raw_reg_event_foo()
    tracing: Pull up calls to trace_define_common_fields()
    tracing: Extract duplicate ftrace_raw_init_event_foo()
    ftrace.h: Use common pr_info fmt string
    tracing: Add stack trace to irqsoff tracer
    tracing: Add trace_dump_stack()
    ring-buffer: Move resize integrity check under reader lock
    ring-buffer: Use sync sched protection on ring buffer resizing
    tracing: Fix wrong usage of strstrip in trace_ksyms

    Linus Torvalds
     

15 Dec, 2009

3 commits


11 Dec, 2009

2 commits

  • While using an application that does splice on the ftrace ring
    buffer at start up, I triggered an integrity check failure.

    Looking into this, I discovered that resizing the buffer performs
    an integrity check after the buffer is resized. This check unfortunately
    is preformed after it releases the reader lock. If a reader is
    reading the buffer it may cause the integrity check to trigger a
    false failure.

    This patch simply moves the integrity checker under the protection
    of the ring buffer reader lock.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • There was a comment in the ring buffer code that says the calling
    layers should prevent tracing or reading of the ring buffer while
    resizing. I have discovered that the tracers do not honor this
    arrangement.

    This patch moves the disabling and synchronizing the ring buffer to
    a higher layer during resizing. This guarantees that no writes
    are occurring while the resize takes place.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

06 Dec, 2009

2 commits

  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (470 commits)
    x86: Fix comments of register/stack access functions
    perf tools: Replace %m with %a in sscanf
    hw-breakpoints: Keep track of user disabled breakpoints
    tracing/syscalls: Make syscall events print callbacks static
    tracing: Add DEFINE_EVENT(), DEFINE_SINGLE_EVENT() support to docbook
    perf: Don't free perf_mmap_data until work has been done
    perf_event: Fix compile error
    perf tools: Fix _GNU_SOURCE macro related strndup() build error
    trace_syscalls: Remove unused syscall_name_to_nr()
    trace_syscalls: Simplify syscall profile
    trace_syscalls: Remove duplicate init_enter_##sname()
    trace_syscalls: Add syscall_nr field to struct syscall_metadata
    trace_syscalls: Remove enter_id exit_id
    trace_syscalls: Set event_enter_##sname->data to its metadata
    trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit
    perf_event: Initialize data.period in perf_swevent_hrtimer()
    perf probe: Simplify event naming
    perf probe: Add --list option for listing current probe events
    perf probe: Add argv_split() from lib/argv_split.c
    perf probe: Move probe event utility functions to probe-event.c
    ...

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (40 commits)
    tracing: Separate raw syscall from syscall tracer
    ring-buffer-benchmark: Add parameters to set produce/consumer priorities
    tracing, function tracer: Clean up strstrip() usage
    ring-buffer benchmark: Run producer/consumer threads at nice +19
    tracing: Remove the stale include/trace/power.h
    tracing: Only print objcopy version warning once from recordmcount
    tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used
    ring-buffer: Move access to commit_page up into function used
    tracing: do not disable interrupts for trace_clock_local
    ring-buffer: Add multiple iterations between benchmark timestamps
    kprobes: Sanitize struct kretprobe_instance allocations
    tracing: Fix to use __always_unused attribute
    compiler: Introduce __always_unused
    tracing: Exit with error if a weak function is used in recordmcount.pl
    tracing: Move conditional into update_funcs() in recordmcount.pl
    tracing: Add regex for weak functions in recordmcount.pl
    tracing: Move mcount section search to front of loop in recordmcount.pl
    tracing: Fix objcopy revision check in recordmcount.pl
    tracing: Check absolute path of input file in recordmcount.pl
    tracing: Correct the check for number of arguments in recordmcount.pl
    ...

    Linus Torvalds
     

17 Nov, 2009

1 commit

  • With the change of the way we process commits. Where a commit only happens
    at the outer most level, and that we don't need to worry about
    a commit ending after the rb_start_commit() has been called, the code
    use to grab the commit page before the tail page to prevent a possible
    race. But this race no longer exists with the rb_start_commit()
    rb_end_commit() interface.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

15 Nov, 2009

1 commit


04 Nov, 2009

2 commits


24 Oct, 2009

2 commits


06 Oct, 2009

1 commit

  • The sign info used for filters in the kernel is also useful to
    applications that process the trace stream. Add it to the format
    files and make it available to userspace.

    Signed-off-by: Tom Zanussi
    Acked-by: Frederic Weisbecker
    Cc: rostedt@goodmis.org
    Cc: lizf@cn.fujitsu.com
    Cc: hch@infradead.org
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Tom Zanussi
     

20 Sep, 2009

1 commit


14 Sep, 2009

1 commit

  • The cmpxchg used by PowerPC does the following:

    ({ \
    __typeof__(*(ptr)) _o_ = (o); \
    __typeof__(*(ptr)) _n_ = (n); \
    (__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_, \
    (unsigned long)_n_, sizeof(*(ptr))); \
    })

    This does a type check of *ptr to both o and n.

    Unfortunately, the code in ring-buffer.c assigns longs to pointers
    and pointers to longs and causes a warning on PowerPC:

    ring_buffer.c: In function 'rb_head_page_set':
    ring_buffer.c:704: warning: initialization makes pointer from integer without a cast
    ring_buffer.c:704: warning: initialization makes pointer from integer without a cast
    ring_buffer.c: In function 'rb_head_page_replace':
    ring_buffer.c:797: warning: initialization makes integer from pointer without a cast

    This patch adds the typecasts inside cmpxchg to annotate that a long is
    being cast to a pointer and a pointer is being casted to a long and this
    removes the PowerPC warnings.

    Reported-by: Stephen Rothwell
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

10 Sep, 2009

1 commit

  • rb_buffer_peek() operates with struct ring_buffer_per_cpu *cpu_buffer
    only. Thus, instead of passing variables buffer and cpu it is better
    to use cpu_buffer directly. This also reduces the risk of races since
    cpu_buffer is not calculated twice.

    Signed-off-by: Robert Richter
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Robert Richter
     

05 Sep, 2009

2 commits

  • Since the ability to swap the cpu buffers adds a small overhead to
    the recording of a trace, we only want to add it when needed.

    Only the irqsoff and preemptoff tracers use this feature, and both are
    not recommended for production kernels. This patch disables its use
    when neither irqsoff nor preemptoff is configured.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Because the irqsoff tracer can swap an internal CPU buffer, it is possible
    that a swap happens between the start of the write and before the committing
    bit is set (the committing bit will disable swapping).

    This patch adds a check for this and will fail the write if it detects it.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

04 Sep, 2009

7 commits

  • Currently the way RB_WARN_ON works, is to disable either the current
    CPU buffer or all CPU buffers, depending on whether a ring_buffer or
    ring_buffer_per_cpu struct was passed into the macro.

    Most users of the RB_WARN_ON pass in the CPU buffer, so only the one
    CPU buffer gets disabled but the rest are still active. This may
    confuse users even though a warning is sent to the console.

    This patch changes the macro to disable the entire buffer even if
    the CPU buffer is passed in.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The latency tracers report the number of items in the trace buffer.
    This uses the ring buffer data to calculate this. Because discarded
    events are also counted, the numbers do not match the number of items
    that are printed. The ring buffer also adds a "padding" item to the
    end of each buffer page which also gets counted as a discarded item.

    This patch decrements the counter to the page entries on a discard.
    This allows us to ignore discarded entries while reading the buffer.

    Decrementing the counter is still safe since it can only happen while
    the committing flag is still set.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The function ring_buffer_event_discard can be used on any item in the
    ring buffer, even after the item was committed. This function provides
    no safety nets and is very race prone.

    An item may be safely removed from the ring buffer before it is committed
    with the ring_buffer_discard_commit.

    Since there are currently no users of this function, and because this
    function is racey and error prone, this patch removes it altogether.

    Note, removing this function also allows the counters to ignore
    all discarded events (patches will follow).

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When the ring buffer uses an iterator (static read mode, not on the
    fly reading), when it crosses a page boundery, it will skip the first
    entry on the next page. The reason is that the last entry of a page
    is usually padding if the page is not full. The padding will not be
    returned to the user.

    The problem arises on ring_buffer_read because it also increments the
    iterator. Because both the read and peek use the same rb_iter_peek,
    the rb_iter_peak will return the padding but also increment to the next
    item. This is because the ring_buffer_peek will not incerment it
    itself.

    The ring_buffer_read will increment it again and then call rb_iter_peek
    again to get the next item. But that will be the second item, not the
    first one on the page.

    The reason this never showed up before, is because the ftrace utility
    always calls ring_buffer_peek first and only uses ring_buffer_read
    to increment to the next item. The ring_buffer_peek will always keep
    the pointer to a valid item and not padding. This just hid the bug.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The loops in the ring buffer that use cpu_relax are not dependent on
    other CPUs. They simply came across some padding in the ring buffer and
    are skipping over them. It is a normal loop and does not require a
    cpu_relax.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • If a commit is taking place on a CPU ring buffer, do not allow it to
    be swapped. Return -EBUSY when this is detected instead.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The callers of reset must ensure that no commit can be taking place
    at the time of the reset. If it does then we may corrupt the ring buffer.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

11 Aug, 2009

1 commit


08 Aug, 2009

1 commit


06 Aug, 2009

3 commits

  • When calling rb_buffer_peek() from ring_buffer_consume() and a
    padding event is returned, the function rb_advance_reader() is
    called twice. This may lead to missing samples or under high
    workloads to the warning below. This patch fixes this. If a padding
    event is returned by rb_buffer_peek() it will be consumed by the
    calling function now.

    Also, I simplified some code in ring_buffer_consume().

    ------------[ cut here ]------------
    WARNING: at /dev/shm/.source/linux/kernel/trace/ring_buffer.c:2289 rb_advance_reader+0x2e/0xc5()
    Hardware name: Anaheim
    Modules linked in:
    Pid: 29, comm: events/2 Tainted: G W 2.6.31-rc3-oprofile-x86_64-standard-00059-g5050dc2 #1
    Call Trace:
    [] ? rb_advance_reader+0x2e/0xc5
    [] warn_slowpath_common+0x77/0x8f
    [] warn_slowpath_null+0xf/0x11
    [] rb_advance_reader+0x2e/0xc5
    [] ring_buffer_consume+0xa0/0xd2
    [] op_cpu_buffer_read_entry+0x21/0x9e
    [] ? __find_get_block+0x4b/0x165
    [] sync_buffer+0xa5/0x401
    [] ? __find_get_block+0x4b/0x165
    [] ? wq_sync_buffer+0x0/0x78
    [] wq_sync_buffer+0x5b/0x78
    [] worker_thread+0x113/0x1ac
    [] ? autoremove_wake_function+0x0/0x38
    [] ? worker_thread+0x0/0x1ac
    [] kthread+0x88/0x92
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0x92
    [] ? child_rip+0x0/0x20
    ---[ end trace f561c0a58fcc89bd ]---

    Cc: Steven Rostedt
    Cc:
    Signed-off-by: Robert Richter
    Signed-off-by: Ingo Molnar

    Robert Richter
     
  • The commit:

    commit e0fdace10e75dac67d906213b780ff1b1a4cc360
    Author: David Miller
    Date: Fri Aug 1 01:11:22 2008 -0700

    debug_locks: set oops_in_progress if we will log messages.

    Otherwise lock debugging messages on runqueue locks can deadlock the
    system due to the wakeups performed by printk().

    Signed-off-by: David S. Miller
    Signed-off-by: Ingo Molnar

    Will permanently set oops_in_progress on any lockdep failure.
    When this triggers it will cause any read from the ring buffer to
    permanently disable the ring buffer (not to mention no locking of
    printk).

    This patch removes the check. It keeps the print in NMI which makes
    sense. This is probably OK, since the ring buffer should not cause
    something to set oops_in_progress anyway.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The function ring_buffer_discard_commit inversed the code path
    of the result of try_to_discard. It should skip incrementing the
    entry counter if try_to_discard succeeded. But instead, it increments
    the entry conder if it succeeded to discard, and does not increment
    it if it fails.

    The result of this bug is that filtering will make the stat counters
    incorrect.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

17 Jul, 2009

1 commit

  • kernel/trace/ring_buffer.c: In function 'rb_tail_page_update':
    kernel/trace/ring_buffer.c:849: warning: value computed is not used
    kernel/trace/ring_buffer.c:850: warning: value computed is not used

    Add "(void)"s to fix this warning, because we don't need here to handle
    the fail case of cmpxchg, it's fine if an interrupt already did the
    job.

    Changed from V1:
    Add a comment(which is written by Steven) for it.

    Signed-off-by: Lai Jiangshan
    Acked-by: Steven Rostedt
    Signed-off-by: Frederic Weisbecker

    Lai Jiangshan
     

08 Jul, 2009

2 commits

  • This patch converts the ring buffers into a completely lockless
    buffer recording system. The read side still takes locks since
    we still serialize readers. But the writers are the ones that
    must be lockless (those can happen in NMIs).

    The main change is to the "head_page" pointer. We write to the
    tail, and read from the head. The "head_page" pointer in the cpu
    buffer is now just a reference to where to look. The real head
    page is now kept in the head_page->list->prev->next pointer.
    That is, in the list head of the previous page we set flags.

    The list pages are allocated to be aligned such that the lowest
    significant bits are always zero pointing to the list. This gives
    us play to put in flags to their pointers.

    bit 0: set when the page is a head page
    bit 1: set when the writer is moving the page (for overwrite mode)

    cmpxchg is used to update the pointer.

    When the writer wraps the buffer and the tail meets the head,
    in overwrite mode, the writer must move the head page forward.
    It first uses cmpxchg to change the pointer flag from 1 to 2.
    Once this is done, the reader on another CPU will not take the
    page from the buffer.

    The writers need to protect against interrupts (we don't bother with
    disabling interrupts because NMIs are allowed to write too).

    After the writer sets the pointer flag to 2, it takes care to
    manage interrupts coming in. This is discribed in detail within the
    comments of the code.

    Changes in version 2:
    - Let reader reset entries value of header page.
    - Fix tail page passing commit page on reader page test.
    - Always increment entries and write counter in rb_tail_page_update
    - Add safety check in rb_set_commit_to_write to break out of infinite loop
    - add mask in rb_is_reader_page

    [ Impact: lock free writing to the ring buffer ]

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • This patch changes the ring buffer data pages from using a link list
    head pointer, to making each buffer page point to another buffer page
    and never back to a "head".

    This makes the handling of the ring buffer less complex, since the
    traversing of the ring buffer pages no longer needs to account for the
    head pointer.

    This change also is needed to make the ring buffer lockless.

    [
    Changes in version 2:

    - Added change that Lai Jiangshan mentioned.

    From: Lai Jiangshan
    Date: Thu, 11 Jun 2009 11:25:48 +0800
    LKML-Reference:

    I'm not sure whether these 4 lines:
    bpage = list_entry(pages.next, struct buffer_page, list);
    list_del_init(&bpage->list);
    cpu_buffer->pages = &bpage->list;

    list_splice(&pages, cpu_buffer->pages);
    equal to these 2 lines:
    cpu_buffer->pages = pages.next;
    list_del(&pages);

    If there are equivalent, I think the second one
    are simpler. It may be not a really necessarily cleanup.

    What I asked is: if there are equivalent, could you use these two line:
    cpu_buffer->pages = pages.next;
    list_del(&pages);
    ]

    [ Impact: simplify the ring buffer to help make it lockless ]

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

25 Jun, 2009

1 commit

  • In hunting down the cause for the hwlat_detector ring buffer spew in
    my failed -next builds it became obvious that folks are now treating
    ring_buffer as something that is generic independent of tracing and thus,
    suitable for public driver consumption.

    Given that there are only a few minor areas in ring_buffer that have any
    reliance on CONFIG_TRACING or CONFIG_FUNCTION_TRACER, provide stubs for
    those and make it generally available.

    Signed-off-by: Paul Mundt
    Cc: Jon Masters
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mundt
     

21 Jun, 2009

1 commit

  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
    tracing/urgent: warn in case of ftrace_start_up inbalance
    tracing/urgent: fix unbalanced ftrace_start_up
    function-graph: add stack frame test
    function-graph: disable when both x86_32 and optimize for size are configured
    ring-buffer: have benchmark test print to trace buffer
    ring-buffer: do not grab locks in nmi
    ring-buffer: add locks around rb_per_cpu_empty
    ring-buffer: check for less than two in size allocation
    ring-buffer: remove useless compile check for buffer_page size
    ring-buffer: remove useless warn on check
    ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index
    tracing: update sample event documentation
    tracing/filters: fix race between filter setting and module unload
    tracing/filters: free filter_string in destroy_preds()
    ring-buffer: use commit counters for commit pointer accounting
    ring-buffer: remove unused variable
    ring-buffer: have benchmark test handle discarded events
    ring-buffer: prevent adding write in discarded area
    tracing/filters: strloc should be unsigned short
    tracing/filters: operand can be negative
    ...

    Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually

    Linus Torvalds
     

18 Jun, 2009

1 commit

  • If ftrace_dump_on_oops is set, and an NMI detects a lockup, then it
    will need to read from the ring buffer. But the read side of the
    ring buffer still takes locks. This patch adds a check on the read
    side that if it is in an NMI, then it will disable the ring buffer
    and not take any locks.

    Reads can still happen on a disabled ring buffer.

    Signed-off-by: Steven Rostedt

    Steven Rostedt