26 Oct, 2011

1 commit

  • * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (121 commits)
    perf symbols: Increase symbol KSYM_NAME_LEN size
    perf hists browser: Refuse 'a' hotkey on non symbolic views
    perf ui browser: Use libslang to read keys
    perf tools: Fix tracing info recording
    perf hists browser: Elide DSO column when it is set to just one DSO, ditto for threads
    perf hists: Don't consider filtered entries when calculating column widths
    perf hists: Don't decay total_period for filtered entries
    perf hists browser: Honour symbol_conf.show_{nr_samples,total_period}
    perf hists browser: Do not exit on tab key with single event
    perf annotate browser: Don't change selection line when returning from callq
    perf tools: handle endianness of feature bitmap
    perf tools: Add prelink suggestion to dso update message
    perf script: Fix unknown feature comment
    perf hists browser: Apply the dso and thread filters when merging new batches
    perf hists: Move the dso and thread filters from hist_browser
    perf ui browser: Honour the xterm colors
    perf top tui: Give color hints just on the percentage, like on --stdio
    perf ui browser: Make the colors configurable and change the defaults
    perf tui: Remove unneeded call to newtCls on startup
    perf hists: Don't format the percentage on hist_entry__snprintf
    ...

    Fix up conflicts in arch/x86/kernel/kprobes.c manually.

    Ingo's tree did the insane "add volatile to const array", which just
    doesn't make sense ("volatile const"?). But we could remove the const
    *and* make the array volatile to make doubly sure that gcc doesn't
    optimize it away..

    Also fix up kernel/trace/ring_buffer.c non-data-conflicts manually: the
    reader_lock has been turned into a raw lock by the core locking merge,
    and there was a new user of it introduced in this perf core merge. Make
    sure that new use also uses the raw accessor functions.

    Linus Torvalds
     

13 Sep, 2011

1 commit

  • The tracing locks can be taken in atomic context and therefore
    cannot be preempted on -rt - annotate it.

    In mainline this change documents the low level nature of
    the lock - otherwise there's no functional difference. Lockdep
    and Sparse checking will work as usual.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

31 Aug, 2011

1 commit

  • The stats file under per_cpu folder provides the number of entries,
    overruns and other statistics about the CPU ring buffer. However, the
    numbers do not provide any indication of how full the ring buffer is in
    bytes compared to the overall size in bytes. Also, it is helpful to know
    the rate at which the cpu buffer is filling up.

    This patch adds an entry "bytes: " in printed stats for per_cpu ring
    buffer which provides the actual bytes consumed in the ring buffer. This
    field includes the number of bytes used by recorded events and the
    padding bytes added when moving the tail pointer to next page.

    It also adds the following time stamps:
    "oldest event ts:" - the oldest timestamp in the ring buffer
    "now ts:" - the timestamp at the time of reading

    The field "now ts" provides a consistent time snapshot to the userspace
    when being read. This is read from the same trace clock used by tracing
    event timestamps.

    Together, these values provide the rate at which the buffer is filling
    up, from the formula:
    bytes / (now_ts - oldest_event_ts)

    Signed-off-by: Vaibhav Nagarnaik
    Cc: Michael Rubin
    Cc: David Sharp
    Link: http://lkml.kernel.org/r/1313531179-9323-3-git-send-email-vnagarnaik@google.com
    Signed-off-by: Steven Rostedt

    Vaibhav Nagarnaik
     

15 Jun, 2011

3 commits

  • The tracing ring buffer is allocated from kernel memory. While
    allocating a large chunk of memory, OOM might happen which destabilizes
    the system. Thus random processes might get killed during the
    allocation.

    This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
    to make it fail more gracefully if the system will not be able to
    complete the allocation request.

    Acked-by: David Rientjes
    Signed-off-by: Vaibhav Nagarnaik
    Cc: Ingo Molnar
    Cc: Frederic Weisbecker
    Cc: Michael Rubin
    Cc: David Sharp
    Link: http://lkml.kernel.org/r/1307491302-9236-1-git-send-email-vnagarnaik@google.com
    Signed-off-by: Steven Rostedt

    Vaibhav Nagarnaik
     
  • This patch replaces the code for getting an unsigned long from a
    userspace buffer by a simple call to kstroul_from_user.
    This makes it easier to read and less error prone.

    Signed-off-by: Peter Huewe
    Link: http://lkml.kernel.org/r/1307476707-14762-1-git-send-email-peterhuewe@gmx.de
    Signed-off-by: Steven Rostedt

    Peter Huewe
     
  • The tracing ring buffer is a group of per-cpu ring buffers where
    allocation and logging is done on a per-cpu basis. The events that are
    generated on a particular CPU are logged in the corresponding buffer.
    This is to provide wait-free writes between CPUs and good NUMA node
    locality while accessing the ring buffer.

    However, the allocation routines consider NUMA locality only for buffer
    page metadata and not for the actual buffer page. This causes the pages
    to be allocated on the NUMA node local to the CPU where the allocation
    routine is running at the time.

    This patch fixes the problem by using a NUMA node specific allocation
    routine so that the pages are allocated from a NUMA node local to the
    logging CPU.

    I tested with the getuid_microbench from autotest. It is a simple binary
    that calls getuid() in a loop and measures the average time for the
    syscall to complete. The following command was used to test:
    $ getuid_microbench 1000000

    Compared the numbers found on kernel with and without this patch and
    found that logging latency decreases by 30-50 ns/call.
    tracing with non-NUMA allocation - 569 ns/call
    tracing with NUMA allocation - 512 ns/call

    Signed-off-by: Vaibhav Nagarnaik
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Michael Rubin
    Cc: David Sharp
    Link: http://lkml.kernel.org/r/1304470602-20366-1-git-send-email-vnagarnaik@google.com
    Signed-off-by: Steven Rostedt

    Vaibhav Nagarnaik
     

26 May, 2011

1 commit

  • Witold reported a reboot caused by the selftests of the dynamic function
    tracer. He sent me a config and I used ktest to do a config_bisect on it
    (as my config did not cause the crash). It pointed out that the problem
    config was CONFIG_PROVE_RCU.

    What happened was that if multiple callbacks are attached to the
    function tracer, we iterate a list of callbacks. Because the list is
    managed by synchronize_sched() and preempt_disable, the access to the
    pointers uses rcu_dereference_raw().

    When PROVE_RCU is enabled, the rcu_dereference_raw() calls some
    debugging functions, which happen to be traced. The tracing of the debug
    function would then call rcu_dereference_raw() which would then call the
    debug function and then... well you get the idea.

    I first wrote two different patches to solve this bug.

    1) add a __rcu_dereference_raw() that would not do any checks.
    2) add notrace to the offending debug functions.

    Both of these patches worked.

    Talking with Paul McKenney on IRC, he suggested to add recursion
    detection instead. This seemed to be a better solution, so I decided to
    implement it. As the task_struct already has a trace_recursion to detect
    recursion in the ring buffer, and that has a very small number it
    allows, I decided to use that same variable to add flags that can detect
    the recursion inside the infrastructure of the function tracer.

    I plan to change it so that the task struct bit can be checked in
    mcount, but as that requires changes to all archs, I will hold that off
    to the next merge window.

    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Frederic Weisbecker
    Cc: Paul E. McKenney
    Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.com
    Reported-by: Witold Baryluk
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

31 Mar, 2011

1 commit


19 Mar, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
    doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
    Update cpuset info & webiste for cgroups
    dcdbas: force SMI to happen when expected
    arch/arm/Kconfig: remove one to many l's in the word.
    asm-generic/user.h: Fix spelling in comment
    drm: fix printk typo 'sracth'
    Remove one to many n's in a word
    Documentation/filesystems/romfs.txt: fixing link to genromfs
    drivers:scsi Change printk typo initate -> initiate
    serial, pch uart: Remove duplicate inclusion of linux/pci.h header
    fs/eventpoll.c: fix spelling
    mm: Fix out-of-date comments which refers non-existent functions
    drm: Fix printk typo 'failled'
    coh901318.c: Change initate to initiate.
    mbox-db5500.c Change initate to initiate.
    edac: correct i82975x error-info reported
    edac: correct i82975x mci initialisation
    edac: correct commented info
    fs: update comments to point correct document
    target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
    ...

    Trivial conflict in fs/eventpoll.c (spelling vs addition)

    Linus Torvalds
     

10 Mar, 2011

3 commits

  • The "Delta way too big" warning might appear on a system with a
    unstable shed clock right after the system is resumed and tracing
    was enabled at time of suspend.

    Since it's not realy a bug, and the unstable sched clock is working
    fast and reliable otherwise, Steven suggested to keep using the
    sched clock in any case and just to make note in the warning itself.

    v2 changes:
    - added #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

    Signed-off-by: Jiri Olsa
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Signed-off-by: David Sharp
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    David Sharp
     
  • Add an "overwrite" trace_option for ftrace to control whether the buffer should
    be overwritten on overflow or not. The default remains to overwrite old events
    when the buffer is full. This patch adds the option to instead discard newest
    events when the buffer is full. This is useful to get a snapshot of traces just
    after enabling traces. Dropping the current event is also a simpler code path.

    Signed-off-by: David Sharp
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    David Sharp
     

18 Feb, 2011

1 commit


09 Feb, 2011

1 commit

  • The warning "Delta way too big" warning might appear on a system with
    unstable shed clock right after the system is resumed and tracing
    was enabled during the suspend.

    Since it's not realy bug, and the unstable sched clock is working
    fast and reliable otherwise, Steven suggested to keep using the
    sched clock in any case and just to make note in the warning itself.

    Signed-off-by: Jiri Olsa
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     

19 Jan, 2011

1 commit

  • Fix a bunch of
    warning: ‘inline’ is not at beginning of declaration
    messages when building a 'make allyesconfig' kernel with -Wextra.

    These warnings are trivial to kill, yet rather annoying when building with
    -Wextra.
    The more we can cut down on pointless crap like this the better (IMHO).

    A previous patch to do this for a 'allnoconfig' build has already been
    merged. This just takes the cleanup a little further.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Jiri Kosina

    Jesper Juhl
     

24 Dec, 2010

1 commit

  • Fix two related problems in the event-copying loop of
    ring_buffer_read_page.

    The loop condition for copying events is off-by-one.
    "len" is the remaining space in the caller-supplied page.
    "size" is the size of the next event (or two events).
    If len == size, then there is just enough space for the next event.

    size was set to rb_event_ts_length, which may include the size of two
    events if the first event is a time-extend, in order to assure time-
    extends are kept together with the event after it. However,
    rb_advance_reader always advances by one event. This would result in the
    event after any time-extend being duplicated. Instead, get the size of
    a single event for the memcpy, but use rb_event_ts_length for the loop
    condition.

    Signed-off-by: David Sharp
    LKML-Reference:
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    David Sharp
     

28 Oct, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (50 commits)
    perf python scripting: Add futex-contention script
    perf python scripting: Fixup cut'n'paste error in sctop script
    perf scripting: Shut up 'perf record' final status
    perf record: Remove newline character from perror() argument
    perf python scripting: Support fedora 11 (audit 1.7.17)
    perf python scripting: Improve the syscalls-by-pid script
    perf python scripting: print the syscall name on sctop
    perf python scripting: Improve the syscalls-counts script
    perf python scripting: Improve the failed-syscalls-by-pid script
    kprobes: Remove redundant text_mutex lock in optimize
    x86/oprofile: Fix uninitialized variable use in debug printk
    tracing: Fix 'faild' -> 'failed' typo
    perf probe: Fix format specified for Dwarf_Off parameter
    perf trace: Fix detection of script extension
    perf trace: Use $PERF_EXEC_PATH in canned report scripts
    perf tools: Document event modifiers
    perf tools: Remove direct slang.h include
    perf_events: Fix for transaction recovery in group_sched_in()
    perf_events: Revert: Fix transaction recovery in group_sched_in()
    perf, x86: Use NUMA aware allocations for PEBS/BTS/DS allocations
    ...

    Linus Torvalds
     

26 Oct, 2010

1 commit


23 Oct, 2010

1 commit

  • * 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
    vfs: make no_llseek the default
    vfs: don't use BKL in default_llseek
    llseek: automatically add .llseek fop
    libfs: use generic_file_llseek for simple_attr
    mac80211: disallow seeks in minstrel debug code
    lirc: make chardev nonseekable
    viotape: use noop_llseek
    raw: use explicit llseek file operations
    ibmasmfs: use generic_file_llseek
    spufs: use llseek in all file operations
    arm/omap: use generic_file_llseek in iommu_debug
    lkdtm: use generic_file_llseek in debugfs
    net/wireless: use generic_file_llseek in debugfs
    drm: use noop_llseek

    Linus Torvalds
     

22 Oct, 2010

1 commit

  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (163 commits)
    tracing: Fix compile issue for trace_sched_wakeup.c
    [S390] hardirq: remove pointless header file includes
    [IA64] Move local_softirq_pending() definition
    perf, powerpc: Fix power_pmu_event_init to not use event->ctx
    ftrace: Remove recursion between recordmcount and scripts/mod/empty
    jump_label: Add COND_STMT(), reducer wrappery
    perf: Optimize sw events
    perf: Use jump_labels to optimize the scheduler hooks
    jump_label: Add atomic_t interface
    jump_label: Use more consistent naming
    perf, hw_breakpoint: Fix crash in hw_breakpoint creation
    perf: Find task before event alloc
    perf: Fix task refcount bugs
    perf: Fix group moving
    irq_work: Add generic hardirq context callbacks
    perf_events: Fix transaction recovery in group_sched_in()
    perf_events: Fix bogus AMD64 generic TLB events
    perf_events: Fix bogus context time tracking
    tracing: Remove parent recording in latency tracer graph options
    tracing: Use one prologue for the preempt irqs off tracer function tracers
    ...

    Linus Torvalds
     

21 Oct, 2010

5 commits

  • With the binding of time extends to events we no longer need to use
    the macro RB_TIMESTAMPS_PER_PAGE. Remove it.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • By using inline and noinline, we are able to make the fast path of
    recording an event 4% faster.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • There's a condition to check if we should add a time extend or
    not in the fast path. But this condition is racey (in the sense
    that we can add a unnecessary time extend, but nothing that
    can break anything). We later check if the time or event time
    delta should be zero or have real data in it (not racey), making
    this first check redundant.

    This check may help save space once in a while, but really is
    not worth the hassle to try to save some space that happens at
    most 134 ms at a time.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When the time between two timestamps is greater than
    2^27 nanosecs (~134 ms) a time extend event is added that extends
    the time difference to 59 bits (~18 years). This is due to
    events only having a 27 bit field to store time.

    Currently this time extend is a separate event. We add it just before
    the event data that is being written to the buffer. But before
    the event data is committed, the event data can also be discarded (as
    with the case of filters). But because the time extend has already been
    committed, it will stay in the buffer.

    If lots of events are being filtered and no event is being
    written, then every 134ms a time extend can be added to the buffer
    without any data attached. To keep from filling the entire buffer
    with time extends, a time extend will never be the first event
    in a page because the page timestamp can be used. Time extends can
    only fill the rest of a page with some data at the beginning.

    This patch binds the time extend with the data. The difference here
    is that the time extend is not committed before the data is added.
    Instead, when a time extend is needed, the space reserved on
    the ring buffer is the time extend + the data event size. The
    time extend is added to the first part of the reserved block and
    the data is added to the second. The time extend event is passed
    back to the reserver, but since the reserver also uses a function
    to find the data portion of the reserved block, no changes to the
    ring buffer interface need to be made.

    When a commit is discarded, we now remove both the time extend and
    the event. With this approach no more than one time extend can
    be in the buffer in a row. Data must always follow a time extend.

    Thanks to Mathieu Desnoyers for suggesting this idea.

    Suggested-by: Mathieu Desnoyers
    Cc: Thomas Gleixner
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The delta between events is passed to the timestamp code by reference
    and the timestamp code will reset the value. But it can be reset
    from the caller. No need to pass it in by reference.

    By changing the call to pass by value, lets gcc optimize the code
    a bit more where it can store the delta in a register and not
    worry about updating the reference.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

20 Oct, 2010

2 commits

  • The original code for the ring buffer had locations that modified
    the timestamp and that change was used by the callers. Now,
    the timestamp is not reused by the callers and there is no reason
    to pass it by reference.

    By changing the call to pass by value, lets gcc optimize the code
    a bit more where it can store the timestamp in a register and not
    worry about updating the reference.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Gcc inlines the slow path of the ring buffer write which can
    hurt performance. This patch simply forces the slow path function
    rb_move_tail() to always be a function.

    The ring_buffer_benchmark module with reader_disabled=1 shows that
    this patch changes the time to record an event from 135 ns to
    132 ns. (3 ns or 2.22% improvement)

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

15 Oct, 2010

1 commit

  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     

13 Oct, 2010

1 commit

  • Time stamps for the ring buffer are created by the difference between
    two events. Each page of the ring buffer holds a full 64 bit timestamp.
    Each event has a 27 bit delta stamp from the last event. The unit of time
    is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events
    happen more than 134 milliseconds apart, a time extend is inserted
    to add more bits for the delta. The time extend has 59 bits, which
    is good for ~18 years.

    Currently the time extend is committed separately from the event.
    If an event is discarded before it is committed, due to filtering,
    the time extend still exists. If all events are being filtered, then
    after ~134 milliseconds a new time extend will be added to the buffer.

    This can only happen till the end of the page. Since each page holds
    a full timestamp, there is no reason to add a time extend to the
    beginning of a page. Time extends can only fill a page that has actual
    data at the beginning, so there is no fear that time extends will fill
    more than a page without any data.

    When reading an event, a loop is made to skip over time extends
    since they are only used to maintain the time stamp and are never
    given to the caller. As a paranoid check to prevent the loop running
    forever, with the knowledge that time extends may only fill a page,
    a check is made that tests the iteration of the loop, and if the
    iteration is more than the number of time extends that can fit in a page
    a warning is printed and the ring buffer is disabled (all of ftrace
    is also disabled with it).

    There is another event type that is called a TIMESTAMP which can
    hold 64 bits of data in the theoretical case that two events happen
    18 years apart. This code has not been implemented, but the name
    of this event exists, as well as the structure for it. The
    size of a TIMESTAMP is 16 bytes, where as a time extend is only
    8 bytes. The macro used to calculate how many time extends can fit on
    a page used the TIMESTAMP size instead of the time extend size
    cutting the amount in half.

    The following test case can easily trigger the warning since we only
    need to have half the page filled with time extends to trigger the
    warning:

    # cd /sys/kernel/debug/tracing/
    # echo function > current_tracer
    # echo 'common_pid < 0' > events/ftrace/function/filter
    # echo > trace
    # echo 1 > trace_marker
    # sleep 120
    # cat trace

    Enabling the function tracer and then setting the filter to only trace
    functions where the process id is negative (no events), then clearing
    the trace buffer to ensure that we have nothing in the buffer,
    then write to trace_marker to add an event to the beginning of a page,
    sleep for 2 minutes (only 35 seconds is probably needed, but this
    guarantees the bug), and then finally reading the trace which will
    trigger the bug.

    This patch fixes the typo and prevents the false positive of that warning.

    Reported-by: Hans J. Koch
    Tested-by: Hans J. Koch
    Cc: Thomas Gleixner
    Cc: Stable Kernel
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

15 Sep, 2010

1 commit


05 Sep, 2010

1 commit


02 Sep, 2010

1 commit

  • While discussing the strictness of the 80 character limit on the
    Kernel Summit Discussion mailing list, I showed examples that I
    broke that limit slightly with some algorithms. In discussing with
    John Linville, what looked better, I realized that two of the
    80 char breaking culprits were an identical expression.

    As a clean up, this patch moves the identical expression into its
    own helper function and that is used instead. As a side effect,
    the offending code is now under the 80 character limit. :-)

    This clean up code also changes the expression from

    (A - B) - C to A - (B + C)

    This makes the code look a little nicer too.

    Cc: John W. Linville
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

16 Aug, 2010

1 commit


07 Aug, 2010

1 commit

  • With the configuration: CONFIG_DEBUG_PAGEALLOC=y and Shaohua's patch:

    [PATCH]x86: make spurious_fault check correct pte bit

    Function call graph trace with the following will trigger a page fault.

    # cd /sys/kernel/debug/tracing/
    # echo function_graph > current_tracer
    # cat per_cpu/cpu1/trace_pipe_raw > /dev/null

    BUG: unable to handle kernel paging request at ffff880006e99000
    IP: [] rb_event_length+0x1/0x3f
    PGD 1b19063 PUD 1b1d063 PMD 3f067 PTE 6e99160
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    last sysfs file: /sys/devices/virtual/net/lo/operstate
    CPU 1
    Modules linked in:

    Pid: 1982, comm: cat Not tainted 2.6.35-rc6-aes+ #300 /Bochs
    RIP: 0010:[] [] rb_event_length+0x1/0x3f
    RSP: 0018:ffff880006475e38 EFLAGS: 00010006
    RAX: 0000000000000ff0 RBX: ffff88000786c630 RCX: 000000000000001d
    RDX: ffff880006e98000 RSI: 0000000000000ff0 RDI: ffff880006e99000
    RBP: ffff880006475eb8 R08: 000000145d7008bd R09: 0000000000000000
    R10: 0000000000008000 R11: ffffffff815d9336 R12: ffff880006d08000
    R13: ffff880006e605d8 R14: 0000000000000000 R15: 0000000000000018
    FS: 00007f2b83e456f0(0000) GS:ffff880002100000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffff880006e99000 CR3: 00000000064a8000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process cat (pid: 1982, threadinfo ffff880006474000, task ffff880006e40770)
    Stack:
    ffff880006475eb8 ffffffff8108730f 0000000000000ff0 000000145d7008bd
    ffff880006e98010 ffff880006d08010 0000000000000296 ffff88000786c640
    ffffffff81002956 0000000000000000 ffff8800071f4680 ffff8800071f4680
    Call Trace:
    [] ? ring_buffer_read_page+0x15a/0x24a
    [] ? return_to_handler+0x15/0x2f
    [] tracing_buffers_read+0xb9/0x164
    [] vfs_read+0xaf/0x150
    [] return_to_handler+0x0/0x2f
    [] __bad_area_nosemaphore+0x17e/0x1a1
    [] return_to_handler+0x0/0x2f
    [] bad_area_nosemaphore+0x13/0x15
    Code: 80 25 b2 16 b3 00 fe c9 c3 55 48 89 e5 f0 80 0d a4 16 b3 00 02 c9 c3 55 31 c0 48 89 e5 48 83 3d 94 16 b3 00 01 c9 0f 94 c0 c3 55 0f 48 89 e5 83 e1 1f b8 08 00 00 00 0f b6 d1 83 fa 1e 74 27
    RIP [] rb_event_length+0x1/0x3f
    RSP
    CR2: ffff880006e99000
    ---[ end trace a6877bb92ccb36bb ]---

    The root cause is that ring_buffer_read_page() may read out of page
    boundary, because the boundary checking is done after reading. This is
    fixed via doing boundary checking before reading.

    Reported-by: Shaohua Li
    Cc:
    Signed-off-by: Huang Ying
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Huang Ying
     

21 Jul, 2010

1 commit


09 Jun, 2010

1 commit


04 Jun, 2010

1 commit

  • The ftrace_preempt_disable/enable functions were to address a
    recursive race caused by the function tracer. The function tracer
    traces all functions which makes it easily susceptible to recursion.
    One area was preempt_enable(). This would call the scheduler and
    the schedulre would call the function tracer and loop.
    (So was it thought).

    The ftrace_preempt_disable/enable was made to protect against recursion
    inside the scheduler by storing the NEED_RESCHED flag. If it was
    set before the ftrace_preempt_disable() it would not call schedule
    on ftrace_preempt_enable(), thinking that if it was set before then
    it would have already scheduled unless it was already in the scheduler.

    This worked fine except in the case of SMP, where another task would set
    the NEED_RESCHED flag for a task on another CPU, and then kick off an
    IPI to trigger it. This could cause the NEED_RESCHED to be saved at
    ftrace_preempt_disable() but the IPI to arrive in the the preempt
    disabled section. The ftrace_preempt_enable() would not call the scheduler
    because the flag was already set before entring the section.

    This bug would cause a missed preemption check and cause lower latencies.

    Investigating further, I found that the recusion caused by the function
    tracer was not due to schedule(), but due to preempt_schedule(). Now
    that preempt_schedule is completely annotated with notrace, the recusion
    no longer is an issue.

    Reported-by: Thomas Gleixner
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

25 May, 2010

2 commits

  • Currently the trace splice code zeros out the excess bytes in the page before
    sending it off to userspace.

    This is to make sure userspace is not getting anything it should not be
    when reading the pages, because the excess data was never initialized
    to zero before writing (for perfomance reasons).

    But the splice code has no business in doing this work, it should be
    done by the ring buffer. With the latest changes for recording lost
    events, the splice code gets it wrong anyway.

    Move the zeroing out of excess bytes into the ring buffer code.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The code to store the "lost events" requires knowing the real end
    of the page. Since the 'commit' includes the padding at the end of
    a page a "real_end" variable was used to keep track of the end not
    including the padding.

    If events were lost, the reader can place the count of events in
    the padded area if there is enough room.

    The bug this patch fixes is that when we fill the page we do not
    reset the real_end variable, and if the writer had wrapped a few
    times, the real_end would be incorrect.

    This patch simply resets the real_end if the page was filled.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

05 May, 2010

1 commit