21 Dec, 2011

1 commit

  • There are four places where new filter for a given filter string is
    created, which involves several different steps. This patch factors
    those steps into create_[system_]filter() functions which in turn make
    use of create_filter_{start|finish}() for common parts.

    The only functional change is that if replace_filter_string() is
    requested and fails, creation fails without any side effect instead of
    being ignored.

    Note that system filter is now installed after the processing is
    complete which makes freeing before and then restoring filter string
    on error unncessary.

    -v2: Rebased to resolve conflict with 49aa29513e and updated both
    create_filter() functions to always set *filterp instead of
    requiring the caller to clear it to %NULL on entry.

    Link: http://lkml.kernel.org/r/1323988305-1469-2-git-send-email-tj@kernel.org

    Signed-off-by: Tejun Heo
    Signed-off-by: Steven Rostedt

    Tejun Heo
     

06 Dec, 2011

2 commits

  • Merge reason: Add these cherry-picked commits so that future changes
    on perf/core don't conflict.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Though not all events have field 'prev_pid', it was allowed to do this:

    # echo 'prev_pid == 100' > events/sched/filter

    but commit 75b8e98263fdb0bfbdeba60d4db463259f1fe8a2 (tracing/filter: Swap
    entire filter of events) broke it without any reason.

    Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com

    Signed-off-by: Li Zefan
    Signed-off-by: Steven Rostedt

    Li Zefan
     

02 Dec, 2011

1 commit

  • ftrace_event_call->filter is sched RCU protected but didn't use
    rcu_assign_pointer(). Use it.

    TODO: Add proper __rcu annotation to call->filter and all its users.

    -v2: Use RCU_INIT_POINTER() for %NULL clearing as suggested by Eric.

    Link: http://lkml.kernel.org/r/20111123164949.GA29639@google.com

    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: stable@kernel.org # (2.6.39+)
    Signed-off-by: Tejun Heo
    Signed-off-by: Steven Rostedt

    Tejun Heo
     

05 Nov, 2011

1 commit

  • The system filter can be used to set multiple event filters that
    exist within the system. But currently it displays the last filter
    written that does not necessarily correspond to the filters within
    the system. The system filter itself is not used to filter any events.
    The system filter is just a means to set filters of the events within
    it.

    Because this causes an ambiguous state when the system filter reads
    a filter string but the events within the system have different strings
    it is best to just show a boiler plate:

    ### global filter ###
    # Use this to set filters for multiple events.
    # Only events with the given fields will be affected.
    # If no events are modified, an error message will be displayed here.

    If an error occurs while writing to the system filter, the system
    filter will replace the boiler plate with the error message as it
    currently does.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

03 Nov, 2011

1 commit

  • Though not all events have field 'prev_pid', it was allowed to do this:

    # echo 'prev_pid == 100' > events/sched/filter

    but commit 75b8e98263fdb0bfbdeba60d4db463259f1fe8a2 (tracing/filter: Swap
    entire filter of events) broke it without any reason.

    Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com

    Signed-off-by: Li Zefan
    Signed-off-by: Steven Rostedt

    Li Zefan
     

31 Aug, 2011

1 commit


20 Aug, 2011

10 commits

  • Adding automated tests running as late_initcall. Tests are
    compiled in with CONFIG_FTRACE_STARTUP_TEST option.

    Adding test event "ftrace_test_filter" used to simulate
    filter processing during event occurance.

    String filters are compiled and tested against several
    test events with different values.

    Also testing that evaluation of explicit predicates is ommited
    due to the lazy filter evaluation.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-11-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Changing filter_match_preds function to use unified predicates tree
    processing.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-10-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Changing fold_pred_tree function to use unified predicates tree
    processing.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-9-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Changing fold_pred_tree function to use unified predicates tree
    processing.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-8-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Changing count_leafs function to use unified predicates tree
    processing.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-7-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Adding walk_pred_tree function to be used for walking throught
    the filter predicates.

    For each predicate the callback function is called, allowing
    users to add their own functionality or customize their way
    through the filter predicates.

    Changing check_pred_tree function to use walk_pred_tree.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-6-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • We dont need to perform lookup through the ftrace_events list,
    instead we can use the 'tp_event' field.

    Each perf_event contains tracepoint event field 'tp_event', which
    got initialized during the tracepoint event initialization.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-5-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • The field_name was used just for finding event's fields. This way we
    don't need to care about field_name allocation/free.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-4-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Making the code cleaner by having one function to fully prepare
    the predicate (create_pred), and another to add the predicate to
    the filter (filter_add_pred).

    As a benefit, this way the dry_run flag stays only inside the
    replace_preds function and is not passed deeper.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-3-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     
  • Don't dynamically allocate filter_pred struct, use static memory.
    This way we can get rid of the code managing the dynamic filter_pred
    struct object.

    The create_pred function integrates create_logical_pred function.
    This way the static predicate memory is returned only from
    one place.

    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1313072754-4620-2-git-send-email-jolsa@redhat.com
    Signed-off-by: Steven Rostedt

    Jiri Olsa
     

07 Jul, 2011

1 commit

  • The event system is freed when its nr_events is set to zero. This happens
    when a module created an event system and then later the module is
    removed. Modules may share systems, so the system is allocated when
    it is created and freed when the modules are unloaded and all the
    events under the system are removed (nr_events set to zero).

    The problem arises when a task opened the "filter" file for the
    system. If the module is unloaded and it removed the last event for
    that system, the system structure is freed. If the task that opened
    the filter file accesses the "filter" file after the system has
    been freed, the system will access an invalid pointer.

    By adding a ref_count, and using it to keep track of what
    is using the event system, we can free it after all users
    are finished with the event system.

    Cc:
    Reported-by: Johannes Berg
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

18 Mar, 2011

1 commit


08 Feb, 2011

12 commits

  • Because the filters are processed first and then activated
    (added to the call), we no longer need to worry about the preds
    of the filter in __alloc_preds() being used. As the filter that
    is allocating preds is not activated yet.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When creating a new filter, instead of allocating the filter to the
    event call first and then processing the filter, it is easier to
    process a temporary filter and then just swap it with the call filter.
    By doing this, it simplifies the code.

    A filter is allocated and processed, when it is done, it is
    swapped with the call filter, synchronize_sched() is called to make
    sure all callers are done with the old filter (filters are called
    with premption disabled), and then the old filter is freed.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • There are many cases that a filter will contain multiple ORs or
    ANDs together near the leafs. Walking up and down the tree to get
    to the next compare can be a waste.

    If there are several ORs or ANDs together, fold them into a single
    pred and allocate an array of the conditions that they check.
    This will speed up the filter by linearly walking an array
    and can still break out if a short circuit condition is met.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Since the filter walks a tree to determine if a match is made or not,
    if the tree was incorrectly created, it could cause an infinite loop.

    Add a check to walk the entire tree before assigning it as a filter
    to make sure the tree is correct.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The test if we should break out early for OR and AND operations
    can be optimized by comparing the current result with
    (pred->op == OP_OR)

    That is if the result is true and the op is an OP_OR, or
    if the result is false and the op is not an OP_OR (thus an OP_AND)
    we can break out early in either case. Otherwise we continue
    processing.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Currently the filter_match_preds() requires a stack to push
    and pop the preds to determine if the filter matches the record or not.
    This has two drawbacks:

    1) It requires a stack to store state information. As this is done
    in fast paths we can't allocate the storage for this stack, and
    we can't use a global as it must be re-entrant. The stack is stored
    on the kernel stack and this greatly limits how many preds we
    may allow.

    2) All conditions are calculated even when a short circuit exists.
    a || b will always calculate a and b even though a was determined
    to be true.

    Using a tree we can walk a constant structure that will save
    the state as we go. The algorithm is simply:

    pred = root;
    do {
    switch (move) {
    case MOVE_DOWN:
    if (OR or AND) {
    pred = left;
    continue;
    }
    if (pred == root)
    break;
    match = pred->fn();
    pred = pred->parent;
    move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT;
    continue;

    case MOVE_UP_FROM_LEFT:
    /* Only OR or AND can be a parent */
    if (match && OR || !match && AND) {
    /* short circuit */
    if (pred == root)
    break;
    pred = pred->parent;
    move = left child ?
    MOVE_UP_FROM_LEFT :
    MOVE_UP_FROM_RIGHT;
    continue;
    }
    pred = pred->right;
    move = MOVE_DOWN;
    continue;

    case MOVE_UP_FROM_RIGHT:
    if (pred == root)
    break;
    pred = pred->parent;
    move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT;
    continue;
    }
    done = 1;
    } while (!done);

    This way there's no strict limit to how many preds we allow
    and it also will short circuit the logical operations when possible.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • When a filter is disabled, free the preds.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Currently we allocate an array of pointers to filter_preds, and then
    allocate a separate filter_pred for each item in the array.
    This adds slight overhead in the filters as it needs to derefernce
    twice to get to the op condition.

    Allocating the preds themselves in a single array removes a dereference
    as well as helps on the cache footprint.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • By separating out the reseting of the filter->n_preds to zero from
    the reallocation of preds for the filter, we can reset groups of
    filters first, call synchronize_sched() just once, and then reallocate
    each of the filters in the system group.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • For every filter that is made, we create predicates to hold every
    operation within the filter. We have a max of 32 predicates that we
    can hold. Currently, we allocate all 32 even if we only need to
    use one.

    Part of the reason we do this is that the filter can be used at
    any moment by any event. Fortunately, the filter is only used
    with preemption disabled. By reseting the count of preds used "n_preds"
    to zero, then performing a synchronize_sched(), we can safely
    free and reallocate a new array of preds.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The ops OR and AND act different from the other ops, as they
    are the only ones to take other ops as their arguements.
    These ops als change the logic of the filter_match_preds.

    By removing the OR and AND fn's we can also remove the val1 and val2
    that is passed to all other fn's and are unused.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The n_preds field of a file can change at anytime, and even can become
    zero, just as the filter is about to be processed by an event.
    In the case that is zero on entering the filter, return 1, telling
    the caller the event matchs and should be trace.

    Also use a variable and assign it with ACCESS_ONCE() such that the
    count stays consistent within the function.

    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

29 Jun, 2010

2 commits


18 May, 2010

1 commit


15 May, 2010

3 commits

  • The filter_active and enable both use an int (4 bytes each) to
    set a single flag. We can save 4 bytes per event by combining the
    two into a single integer.

    text data bss dec hex filename
    4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
    4894944 1018052 861512 6774508 675eec vmlinux.id
    4894871 1012292 861512 6768675 674823 vmlinux.flags

    This gives us another 5K in savings.

    The modification of both the enable and filter fields are done
    under the event_mutex, so it is still safe to combine the two.

    Note: Although Mathieu gave his Acked-by, he would like it documented
    that the reads of flags are not protected by the mutex. The way the
    code works, these reads will not break anything, but will have a
    residual effect. Since this behavior is the same even before this
    patch, describing this situation is left to another patch, as this
    patch does not change the behavior, but just brought it to Mathieu's
    attention.

    v2: Updated the event trace self test to for this change.

    Acked-by: Mathieu Desnoyers
    Acked-by: Masami Hiramatsu
    Acked-by: Frederic Weisbecker
    Cc: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Now that the trace_event structure is embedded in the ftrace_event_call
    structure, there is no need for the ftrace_event_call id field.
    The id field is the same as the trace_event type field.

    Removing the id and re-arranging the structure brings down the tracepoint
    footprint by another 5K.

    text data bss dec hex filename
    4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
    4895024 1023812 861512 6780348 6775bc vmlinux.print
    4894944 1018052 861512 6774508 675eec vmlinux.id

    Acked-by: Mathieu Desnoyers
    Acked-by: Masami Hiramatsu
    Acked-by: Frederic Weisbecker
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Move the defined fields from the event to the class structure.
    Since the fields of the event are defined by the class they belong
    to, it makes sense to have the class hold the information instead
    of the individual events. The events of the same class would just
    hold duplicate information.

    After this change the size of the kernel dropped another 3K:

    text data bss dec hex filename
    4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
    4900252 1057412 861512 6819176 680d68 vmlinux.regs
    4900375 1053380 861512 6815267 67fe23 vmlinux.fields

    Although the text increased, this was mainly due to the C files
    having to adapt to the change. This is a constant increase, where
    new tracepoints will not increase the Text. But the big drop is
    in the data size (as well as needed allocations to hold the fields).
    This will give even more savings as more tracepoints are created.

    Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS()
    with several DEFINE_EVENT()s, then the savings will be lost. But
    we are pushing developers to consolidate events with DEFINE_EVENT()
    so this should not be an issue.

    The kprobes define a unique class to every new event, but are dynamic
    so it should not be a issue.

    The syscalls however have a single class but the fields for the individual
    events are different. The syscalls use a metadata to define the
    fields. I moved the fields list from the event to the metadata and
    added a "get_fields()" function to the class. This function is used
    to find the fields. For normal events and kprobes, get_fields() just
    returns a pointer to the fields list_head in the class. For syscall
    events, it returns the fields list_head in the metadata for the event.

    v2: Fixed the syscall fields. The syscall metadata needs a list
    of fields for both enter and exit.

    Acked-by: Frederic Weisbecker
    Acked-by: Mathieu Desnoyers
    Acked-by: Masami Hiramatsu
    Cc: Tom Zanussi
    Cc: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

14 May, 2010

1 commit

  • This patch creates a ftrace_event_class struct that event structs point to.
    This class struct will be made to hold information to modify the
    events. Currently the class struct only holds the events system name.

    This patch slightly increases the size, but this change lays the ground work
    of other changes to make the footprint of tracepoints smaller.

    With 82 standard tracepoints, and 618 system call tracepoints
    (two tracepoints per syscall: enter and exit):

    text data bss dec hex filename
    4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
    4914025 1088868 861512 6864405 68be15 vmlinux.class

    This patch also cleans up some stale comments in ftrace.h.

    v2: Fixed missing semi-colon in macro.

    Acked-by: Frederic Weisbecker
    Acked-by: Mathieu Desnoyers
    Acked-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

07 May, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo